reward function
Recently Published Documents


TOTAL DOCUMENTS

464
(FIVE YEARS 262)

H-INDEX

29
(FIVE YEARS 6)

2022 ◽  
Vol 73 ◽  
pp. 173-208
Author(s):  
Rodrigo Toro Icarte ◽  
Toryn Q. Klassen ◽  
Richard Valenzano ◽  
Sheila A. McIlraith

Reinforcement learning (RL) methods usually treat reward functions as black boxes. As such, these methods must extensively interact with the environment in order to discover rewards and optimal policies. In most RL applications, however, users have to program the reward function and, hence, there is the opportunity to make the reward function visible – to show the reward function’s code to the RL agent so it can exploit the function’s internal structure to learn optimal policies in a more sample efficient manner. In this paper, we show how to accomplish this idea in two steps. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning. Experiments on tabular and continuous domains, across different tasks and RL agents, show the benefits of exploiting reward structure with respect to sample efficiency and the quality of resultant policies. Finally, by virtue of being a form of finite state machine, reward machines have the expressive power of a regular language and as such support loops, sequences and conditionals, as well as the expression of temporally extended properties typical of linear temporal logic and non-Markovian reward specification.


2022 ◽  
Vol 8 (1) ◽  
Author(s):  
Tailong Xiao ◽  
Jianping Fan ◽  
Guihua Zeng

AbstractParameter estimation is a pivotal task, where quantum technologies can enhance precision greatly. We investigate the time-dependent parameter estimation based on deep reinforcement learning, where the noise-free and noisy bounds of parameter estimation are derived from a geometrical perspective. We propose a physical-inspired linear time-correlated control ansatz and a general well-defined reward function integrated with the derived bounds to accelerate the network training for fast generating quantum control signals. In the light of the proposed scheme, we validate the performance of time-dependent and time-independent parameter estimation under noise-free and noisy dynamics. In particular, we evaluate the transferability of the scheme when the parameter has a shift from the true parameter. The simulation showcases the robustness and sample efficiency of the scheme and achieves the state-of-the-art performance. Our work highlights the universality and global optimality of deep reinforcement learning over conventional methods in practical parameter estimation of quantum sensing.


Author(s):  
Fangjian Li ◽  
John R Wagner ◽  
Yue Wang

Abstract Inverse reinforcement learning (IRL) has been successfully applied in many robotics and autonomous driving studies without the need for hand-tuning a reward function. However, it suffers from safety issues. Compared to the reinforcement learning (RL) algorithms, IRL is even more vulnerable to unsafe situations as it can only infer the importance of safety based on expert demonstrations. In this paper, we propose a safety-aware adversarial inverse reinforcement learning algorithm (S-AIRL). First, the control barrier function (CBF) is used to guide the training of a safety critic, which leverages the knowledge of system dynamics in the sampling process without training an additional guiding policy. The trained safety critic is then integrated into the discriminator to help discern the generated data and expert demonstrations from the standpoint of safety. Finally, to further improve the safety awareness, a regulator is introduced in the loss function of the discriminator training to prevent the recovered reward function from assigning high rewards to the risky behaviors. We tested our S-AIRL in the highway autonomous driving scenario. Comparing to the original AIRL algorithm, with the same level of imitation learning (IL) performance, the proposed S-AIRL can reduce the collision rate by 32.6%.


2022 ◽  
pp. 1701-1719
Author(s):  
Vimaladevi M. ◽  
Zayaraz G.

The use of software in mission critical applications poses greater quality needs. Quality assurance activities are aimed at ensuring such quality requirements of the software system. Antifragility is a property of software that increases its quality as a result of errors, faults, and attacks. Such antifragile software systems proactively accepts the errors and learns from these errors and relies on test-driven development methodology. In this article, an innovative approach is proposed which uses a fault injection methodology to perform the task of quality assurance. Such a fault injection mechanism makes the software antifragile and it gets better with the increase in the intensity of such errors up to a point. A software quality game is designed as a two-player game model with stressor and backer entities. The stressor is an error model which injects errors into the software system. The software system acts as a backer, and tries to recover from the errors. The backer uses a cheating mechanism by implementing software Learning Hooks (SLH) which learn from the injected errors. This makes the software antifragile and leads to improvement of the code. Moreover, the SLH uses a Q-Learning reinforcement algorithm with a hybrid reward function to learn from the incoming defects. The game is played for a maximum of K errors. This approach is introduced to incorporate the anti-fragility aspects into the software system within the existing framework of object-oriented development. The game is run at the end of every increment during the construction of object-oriented systems. A detailed report of the injected errors and the actions taken is output at the end of each increment so that necessary actions are incorporated into the actual software during the next iteration. This ensures at the end of all the iterations, the software is immune to majority of the so-called Black Swans. The experiment is conducted with an open source Java sample and the results are studied selected two categories of evaluation parameters. The defect related performance parameters considered are the defect density, defect distribution over different iterations, and number of hooks inserted. These parameters show much reduction in adopting the proposed approach. The quality parameters such as abstraction, inheritance, and coupling are studied for various iterations and this approach ensures considerable increases in these parameters.


2021 ◽  
Author(s):  
Rachit Dubey ◽  
Tom Griffiths ◽  
Peter Dayan

The pursuit of happiness is not easy. Habituation to positive changes in lifestyle and constant comparisons leave us unhappy even in the best of conditions. Given their disruptive impact, it remains a puzzle why habituation and comparisons have come to be a part of cognition in the first place. Here, we present computational evidence that suggests that these features might play an important role in promoting adaptive behavior. Using the framework of reinforcement learning, we explore the benefit of employing a reward function that, in addition to the reward provided by the underlying task, also depends on prior expectations and relative comparisons. We find that while agents equipped with this reward function are less "happy", they learn faster and significantly outperform standard reward-based agents in a wide range of environments. The fact that these features provide considerable adaptive benefits might explain why we have the propensity to keep wanting more, even if it contributes to depression, materialism, and overconsumption.


2021 ◽  
Author(s):  
Victor F. Araman ◽  
René A. Caldentey

A decision maker (DM) must choose an action in order to maximize a reward function that depends on the DM’s action as well as on an unknown parameter Θ. The DM can delay taking the action in order to experiment and gather additional information on Θ. We model the problem using a Bayesian sequential experimentation framework and use dynamic programming and diffusion-asymptotic analysis to solve it. For that, we consider environments in which the average number of experiments that is conducted per unit of time is large and the informativeness of each individual experiment is low. Under such regimes, we derive a diffusion approximation for the sequential experimentation problem, which provides a number of important insights about the nature of the problem and its solution. First, it reveals that the problems of (i) selecting the optimal sequence of experiments to use and (ii) deciding the optimal time when to stop experimenting decouple and can be solved independently. Second, it shows that an optimal experimentation policy is one that chooses the experiment that maximizes the instantaneous volatility of the belief process. Third, the diffusion approximation provides a more mathematically malleable formulation that we can solve in closed form and suggests efficient heuristics for the nonasympototic regime. Our solution method also shows that the complexity of the problem grows only quadratically with the cardinality of the set of actions from which the decision maker can choose. We illustrate our methodology and results using a concrete application in the context of assortment selection and new product introduction. Specifically, we study the problem of a seller who wants to select an optimal assortment of products to launch into the marketplace and is uncertain about consumers’ preferences. Motivated by emerging practices in e-commerce, we assume that the seller is able to use a crowd voting system to learn these preferences before a final assortment decision is made. In this context, we undertake an extensive numerical analysis to assess the value of learning and demonstrate the effectiveness and robustness of the heuristics derived from the diffusion approximation. This paper was accepted by Omar Besbes, revenue management and market analytics.


Author(s):  
Hao Ji ◽  
Yan Jin

Abstract Self-organizing systems (SOS) are developed to perform complex tasks in unforeseen situations with adaptability. Predefining rules for self-organizing agents can be challenging, especially in tasks with high complexity and changing environments. Our previous work has introduced a multiagent reinforcement learning (RL) model as a design approach to solving the rule generation problem of SOS. A deep multiagent RL algorithm was devised to train agents to acquire the task and self-organizing knowledge. However, the simulation was based on one specific task environment. Sensitivity of SOS to reward functions and systematic evaluation of SOS designed with multiagent RL remain an issue. In this paper, we introduced a rotation reward function to regulate agent behaviors during training and tested different weights of such reward on SOS performance in two case studies: box-pushing and T-shape assembly. Additionally, we proposed three metrics to evaluate the SOS: learning stability, quality of learned knowledge, and scalability. Results show that depending on the type of tasks; designers may choose appropriate weights of rotation reward to obtain the full potential of agents’ learning capability. Good learning stability and quality of knowledge can be achieved with an optimal range of team sizes. Scaling up to larger team sizes has better performance than scaling downwards.


2021 ◽  
pp. 1-10
Author(s):  
Wei Zhou ◽  
Xing Jiang ◽  
Bingli Guo (Member, IEEE) ◽  
Lingyu Meng

Currently, Quality-of-Service (QoS)-aware routing is one of the crucial challenges in Software Defined Network (SDN). The QoS performances, e.g. latency, packet loss ratio and throughput, must be optimized to improve the performance of network. Traditional static routing algorithms based on Open Shortest Path First (OSPF) could not adapt to traffic fluctuation, which may cause severe network congestion and service degradation. Central intelligence of SDN controller and recent breakthroughs of Deep Reinforcement Learning (DRL) pose a promising solution to tackle this challenge. Thus, we propose an on-policy DRL mechanism, namely the PPO-based (Proximal Policy Optimization) QoS-aware Routing Optimization Mechanism (PQROM), to achieve a general and re-customizable routing optimization. PQROM can dynamically update the routing calculation by adjusting the reward function according to different optimization objectives, and it is independent of any specific network pattern. Additionally, as a black-box one-step optimization, PQROM is qualified for both continuous and discrete action space with high-dimensional input and output. The OMNeT ++ simulation experiment results show that PQROM not only has good convergence, but also has better stability compared with OSPF, less training time and simpler hyper-parameters adjustment than Deep Deterministic Policy Gradient (DDPG) and less hardware consumption than Asynchronous Advantage Actor-Critic (A3C).


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Xinmin Li ◽  
Jiahui Li ◽  
Dandan Liu

Unmanned aerial vehicle (UAV) technique with flexible deployment has enabled the development of Internet of Things (IoT) applications. However, it is difficult to guarantee the freshness of information delivery for the energy-limited UAV. Thus, we study the trajectory design in the multiple-UAV communication system, in which the massive ground devices send the individual information to mobile UAV base stations under the demand of information freshness. First, an energy-efficiency (EE) maximization optimization problem is formulated under the rest energy, safety distance, and age of information (AoI) constraints. However, it is difficult to solve the optimization problem due to the nonconvex objective function and unknown dynamic environment. Second, a trajectory design based on the deep Q-network method is proposed, in which the state space considering energy efficiency, rest energy, and AoI and the efficient reward function related with EE performance are constructed, respectively. Furthermore, to avoid the dependency of training data for the neural network, the experience replay and random sampling for batch are adopted. Finally, we validate the system performance of the proposed scheme. Simulation results show that the proposed scheme can achieve a better EE performance compared with the benchmark scheme.


2021 ◽  
Author(s):  
Abhishek Gupta

In this thesis, we propose an environment perception framework for autonomous driving using deep reinforcement learning (DRL) that exhibits learning in autonomous vehicles under complex interactions with the environment, without being explicitly trained on driving datasets. Unlike existing techniques, our proposed technique takes the learning loss into account under deterministic as well as stochastic policy gradient. We apply DRL to object detection and safe navigation while enhancing a self-driving vehicle’s ability to discern meaningful information from surrounding data. For efficient environmental perception and object detection, various Q-learning based methods have been proposed in the literature. Unlike other works, this thesis proposes a collaborative deterministic as well as stochastic policy gradient based on DRL. Our technique is a combination of variational autoencoder (VAE), deep deterministic policy gradient (DDPG), and soft actor-critic (SAC) that adequately trains a self-driving vehicle. In this work, we focus on uninterrupted and reasonably safe autonomous driving without colliding with an obstacle or steering off the track. We propose a collaborative framework that utilizes best features of VAE, DDPG, and SAC and models autonomous driving as partly stochastic and partly deterministic policy gradient problem in continuous action space, and continuous state space. To ensure that the vehicle traverses the road over a considerable period of time, we employ a reward-penalty based system where a higher negative penalty is associated with an unfavourable action and a comparatively lower positive reward is awarded for favourable actions. We also examine the variations in policy loss, value loss, reward function, and cumulative reward for ‘VAE+DDPG’ and ‘VAE+SAC’ over the learning process.


Sign in / Sign up

Export Citation Format

Share Document