Reinforcement Learning in Sparse-Reward Environments with Hindsight Policy Gradients

Neural Computation ◽

10.1162/neco_a_01387 ◽

2021 ◽

pp. 1-56

Author(s):

Paulo Rauber ◽

Avinash Ummadisingu ◽

Filipe Mutz ◽

Jürgen Schmidhuber

Keyword(s):

Reinforcement Learning ◽

Broad Class ◽

Gradient Methods ◽

Learning Agents ◽

Learning Agent ◽

Policy Gradient ◽

Efficient Learning ◽

Desirable Behavior ◽

Selection Of

Abstract A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enabling sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this letter, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable in-crease in sample efficiency.

An Evaluation Methodology for Interactive Reinforcement Learning with Simulated Users

Biomimetics ◽

10.3390/biomimetics6010013 ◽

2021 ◽

Vol 6 (1) ◽

pp. 13

Author(s):

Adam Bignold ◽

Francisco Cruz ◽

Richard Dazeley ◽

Peter Vamplew ◽

Cameron Foale

Keyword(s):

Reinforcement Learning ◽

Information Source ◽

Human Interaction ◽

Evaluation Methodology ◽

External Information ◽

Preliminary Evaluation ◽

Learning Agents ◽

Learning Agent ◽

Knowledge Bias ◽

The Impact

Interactive reinforcement learning methods utilise an external information source to evaluate decisions and accelerate learning. Previous work has shown that human advice could significantly improve learning agents’ performance. When evaluating reinforcement learning algorithms, it is common to repeat experiments as parameters are altered or to gain a sufficient sample size. In this regard, to require human interaction every time an experiment is restarted is undesirable, particularly when the expense in doing so can be considerable. Additionally, reusing the same people for the experiment introduces bias, as they will learn the behaviour of the agent and the dynamics of the environment. This paper presents a methodology for evaluating interactive reinforcement learning agents by employing simulated users. Simulated users allow human knowledge, bias, and interaction to be simulated. The use of simulated users allows the development and testing of reinforcement learning agents, and can provide indicative results of agent performance under defined human constraints. While simulated users are no replacement for actual humans, they do offer an affordable and fast alternative for evaluative assisted agents. We introduce a method for performing a preliminary evaluation utilising simulated users to show how performance changes depending on the type of user assisting the agent. Moreover, we describe how human interaction may be simulated, and present an experiment illustrating the applicability of simulating users in evaluating agent performance when assisted by different types of trainers. Experimental results show that the use of this methodology allows for greater insight into the performance of interactive reinforcement learning agents when advised by different users. The use of simulated users with varying characteristics allows for evaluation of the impact of those characteristics on the behaviour of the learning agent.

Designing Internal Reward of Reinforcement Learning Agents in Multi-Step Dilemma Problem

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2013.p0926 ◽

2013 ◽

Vol 17 (6) ◽

pp. 926-931 ◽

Cited By ~ 5

Author(s):

Yoshihiro Ichikawa ◽

◽

Keiki Takadama

Keyword(s):

Reinforcement Learning ◽

Local Convergence ◽

Average Reward ◽

Q Value ◽

Learning Agents ◽

Learning Agent ◽

Simulation Results ◽

External Reward

This paper proposes the reinforcement learning agent that estimates internal rewards using external rewards in order to avoid conflict in multi-step dilemma problem. Intensive simulation results have revealed that the agent succeeds in avoiding local convergence and obtains a behavior policy for reaching a higher reward by updating the Q-value using the value that is subtracted the average reward from an external reward.

Contracts for Difference: A Reinforcement Learning Approach

Journal of Risk and Financial Management ◽

10.3390/jrfm13040078 ◽

2020 ◽

Vol 13 (4) ◽

pp. 78

Author(s):

Nico Zengeler ◽

Uwe Handmann

Keyword(s):

Reinforcement Learning ◽

Short Term Memory ◽

Learning Agents ◽

Learning Framework ◽

Learning Agent ◽

Markov Decision ◽

Economic Trends ◽

Model Size ◽

Contracts For Difference ◽

Partially Observable

We present a deep reinforcement learning framework for an automatic trading of contracts for difference (CfD) on indices at a high frequency. Our contribution proves that reinforcement learning agents with recurrent long short-term memory (LSTM) networks can learn from recent market history and outperform the market. Usually, these approaches depend on a low latency. In a real-world example, we show that an increased model size may compensate for a higher latency. As the noisy nature of economic trends complicates predictions, especially in speculative assets, our approach does not predict courses but instead uses a reinforcement learning agent to learn an overall lucrative trading policy. Therefore, we simulate a virtual market environment, based on historical trading data. Our environment provides a partially observable Markov decision process (POMDP) to reinforcement learners and allows the training of various strategies.

The Successful Ingredients of Policy Gradient Algorithms

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/338 ◽

2021 ◽

Author(s):

Sven Gronauer ◽

Martin Gottwald ◽

Klaus Diepold

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Gradient Methods ◽

Gradient Algorithms ◽

The Sublime ◽

Policy Gradient ◽

Art Performance ◽

Underlying Mechanisms

Despite the sublime success in recent years, the underlying mechanisms powering the advances of reinforcement learning are yet poorly understood. In this paper, we identify these mechanisms - which we call ingredients - in on-policy policy gradient methods and empirically determine their impact on the learning. To allow an equitable assessment, we conduct our experiments based on a unified and modular implementation. Our results underline the significance of recent algorithmic advances and demonstrate that reaching state-of-the-art performance may not need sophisticated algorithms but can also be accomplished by the combination of a few simple ingredients.

Restraining Bolts for Reinforcement Learning Agents

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i09.7114 ◽

2020 ◽

Vol 34 (09) ◽

pp. 13659-13662

Author(s):

Giuseppe De Giacomo ◽

Luca Iocchi ◽

Marco Favorito ◽

Fabio Patrizi

Keyword(s):

Reinforcement Learning ◽

Science Fiction ◽

Linear Time ◽

Learning Agents ◽

Low Level ◽

Learning Agent ◽

Time Logic ◽

The World ◽

High Level

In this work we have investigated the concept of “restraining bolt”, inspired by Science Fiction. We have two distinct sets of features extracted from the world, one by the agent and one by the authority imposing some restraining specifications on the behaviour of the agent (the “restraining bolt”). The two sets of features and, hence the model of the world attainable from them, are apparently unrelated since of interest to independent parties. However they both account for (aspects of) the same world. We have considered the case in which the agent is a reinforcement learning agent on a set of low-level (subsymbolic) features, while the restraining bolt is specified logically using linear time logic on finite traces f/f over a set of high-level symbolic features. We show formally, and illustrate with examples, that, under general circumstances, the agent can learn while shaping its goals to suitably conform (as much as possible) to the restraining bolt specifications.1

Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

PLoS Computational Biology ◽

10.1371/journal.pcbi.1000586 ◽

2009 ◽

Vol 5 (12) ◽

pp. e1000586 ◽

Cited By ~ 55

Author(s):

Eleni Vasilaki ◽

Nicolas Frémaux ◽

Robert Urbanczik ◽

Walter Senn ◽

Wulfram Gerstner

Keyword(s):

Reinforcement Learning ◽

Gradient Methods ◽

Action Space ◽

Continuous State ◽

Policy Gradient

Reinforcement Learning in Linear Quadratic Deep Structured Teams: Global Convergence of Policy Gradient Methods

2020 59th IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc42340.2020.9304234 ◽

2020 ◽

Author(s):

Vida Fathi ◽

Jalal Arabneydi ◽

Amir G. Aghdam

Keyword(s):

Reinforcement Learning ◽

Global Convergence ◽

Gradient Methods ◽

Linear Quadratic ◽

Policy Gradient

Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/821 ◽

2019 ◽

Author(s):

Muhammad Masood ◽

Finale Doshi-Velez

Keyword(s):

Reinforcement Learning ◽

Optimization Technique ◽

Gradient Methods ◽

Domain Expert ◽

Learning Methods ◽

Maximum Mean Discrepancy ◽

Optimal Policies ◽

Policy Gradient ◽

Gradient Based ◽

The Difference

Standard reinforcement learning methods aim to master one way of solving a task whereas there may exist multiple near-optimal policies. Being able to identify this collection of near-optimal policies can allow a domain expert to efficiently explore the space of reasonable solutions. Unfortunately, existing approaches that quantify uncertainty over policies are not ultimately relevant to finding policies with qualitatively distinct behaviors. In this work, we formalize the difference between policies as a difference between the distribution of trajectories induced by each policy, which encourages diversity with respect to both state visitation and action choices. We derive a gradient-based optimization technique that can be combined with existing policy gradient methods to now identify diverse collections of well-performing policies. We demonstrate our approach on benchmarks and a healthcare task.

Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization

Applied Sciences ◽

10.3390/app11031131 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1131

Author(s):

Liwei Hou ◽

Hengsheng Wang ◽

Haoran Zou ◽

Qun Wang

Keyword(s):

Reinforcement Learning ◽

Learning Process ◽

Gradient Methods ◽

Policy Network ◽

Practice Phase ◽

Network Learning ◽

Network Parameters ◽

Model Free ◽

Policy Gradient ◽

Skills Learning

Autonomous learning of robotic skills seems to be more natural and more practical than engineered skills, analogous to the learning process of human individuals. Policy gradient methods are a type of reinforcement learning technique which have great potential in solving robot skills learning problems. However, policy gradient methods require too many instances of robot online interaction with the environment in order to learn a good policy, which means lower efficiency of the learning process and a higher likelihood of damage to both the robot and the environment. In this paper, we propose a two-phase (imitation phase and practice phase) framework for efficient learning of robot walking skills, in which we pay more attention to the quality of skill learning and sample efficiency at the same time. The training starts with what we call the first stage or the imitation phase of learning, updating the parameters of the policy network in a supervised learning manner. The training set used in the policy network learning is composed of the experienced trajectories output by the iterative linear Gaussian controller. This paper also refers to these trajectories as near-optimal experiences. In the second stage, or the practice phase, the experiences for policy network learning are collected directly from online interactions, and the policy network parameters are updated with model-free reinforcement learning. The experiences from both stages are stored in the weighted replay buffer, and they are arranged in order according to the experience scoring algorithm proposed in this paper. The proposed framework is tested on a biped robot walking task in a MATLAB simulation environment. The results show that the sample efficiency of the proposed framework is much higher than ordinary policy gradient algorithms. The algorithm proposed in this paper achieved the highest cumulative reward, and the robot learned better walking skills autonomously. In addition, the weighted replay buffer method can be made as a general module for other model-free reinforcement learning algorithms. Our framework provides a new way to combine model-based reinforcement learning with model-free reinforcement learning to efficiently update the policy network parameters in the process of robot skills learning.

Accelerating the training of deep reinforcement learning in autonomous driving

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i3.pp649-656 ◽

2021 ◽

Vol 10 (3) ◽

pp. 649

Author(s):

Emmanuel Ifeanyi Iroegbu ◽

Devaraj Madhavi

Keyword(s):

Reinforcement Learning ◽

Autonomous Vehicle ◽

Autonomous Driving ◽

High Dimensional ◽

Training Time ◽

Learning Agent ◽

Policy Gradient ◽

Low Dimensional ◽

Policy Optimization

Deep reinforcement learning has been successful in solving common autonomous driving tasks such as lane-keeping by simply using pixel data from the front view camera as input. However, raw pixel data contains a very high-dimensional observation that affects the learning quality of the agent due to the complexity imposed by a 'realistic' urban environment. Ergo, we investigate how compressing the raw pixel data from high-dimensional state to low-dimensional latent space offline using a variational autoencoder can significantly improve the training of a deep reinforcement learning agent. We evaluated our method on a simulated autonomous vehicle in car learning to act and compared our results with many baselines including deep deterministic policy gradient, proximal policy optimization, and soft actorcritic. The result shows that the method greatly accelerates the training time and there was a remarkable improvement in the quality of the deep reinforcement learning agent.