Safe option-critic: learning safety in the option-critic architecture

Abstract Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications but also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.

Download Full-text

Application of a Deep Deterministic Policy Gradient Algorithm for Energy-Aimed Timetable Rescheduling Problem

Energies ◽

10.3390/en12183461 ◽

2019 ◽

Vol 12 (18) ◽

pp. 3461 ◽

Cited By ~ 6

Author(s):

Guang Yang ◽

Feng Zhang ◽

Cheng Gong ◽

Shiwen Zhang

Keyword(s):

Reinforcement Learning ◽

Real Time ◽

Learning Algorithm ◽

Gradient Algorithm ◽

Q Learning ◽

Train Timetable ◽

Continuous State ◽

Policy Gradient ◽

Random Disturbances ◽

Metro Network

Reinforcement learning has potential in the area of intelligent transportation due to its generality and real-time feature. The Q-learning algorithm, which is an early proposed algorithm, has its own merits to solve the train timetable rescheduling (TTR) problem. However, it has shortage in two aspects: Dimensional limits of action and a slow convergence rate. In this paper, a deep deterministic policy gradient (DDPG) algorithm is applied to solve the energy-aimed train timetable rescheduling (ETTR) problem. This algorithm belongs to reinforcement learning, which fulfills real-time requirements of the ETTR problem, and has adaptability on random disturbances. Superior to the Q-learning, DDPG has a continuous state space and action space. After enough training, the learning agent based on DDPG takes proper action by adjusting the cruising speed and the dwelling time continuously for each train in a metro network when random disturbances happen. Although training needs an iteration for thousands of episodes, the policy decision during each testing episode takes a very short time. Models for the metro network, based on a real case of the Shanghai Metro Line 1, are established as a training and testing environment. To validate the energy-saving effect and the real-time feature of the proposed algorithm, four experiments are designed and conducted. Compared with the no action strategy, results show that the proposed algorithm has real-time performance, and saves a significant percentage of energy under random disturbances.

Download Full-text

An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions

Neural Computation ◽

10.1162/neco_a_00808 ◽

2016 ◽

Vol 28 (3) ◽

pp. 563-593 ◽

Cited By ~ 1

Author(s):

Yao Ma ◽

Tingting Zhao ◽

Kohei Hatano ◽

Masashi Sugiyama

Keyword(s):

Gradient Algorithm ◽

Learning Problem ◽

Parameter Spaces ◽

State Action ◽

Reward Function ◽

Continuous State ◽

Policy Gradient ◽

Markov Decision ◽

The Difference ◽

Changes Over Time

We consider the learning problem under an online Markov decision process (MDP) aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret—the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this letter, we show that a simple online policy gradient algorithm achieves regret [Formula: see text] for T steps under a certain concavity assumption and [Formula: see text] under a strong concavity assumption. To the best of our knowledge, this is the first work to present an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.

Download Full-text

Heuristic Gait Learning of Quadruped Robot Based on Deep Deterministic Policy Gradient Algorithm

2020 Chinese Automation Congress (CAC) ◽

10.1109/cac51589.2020.9326973 ◽

2020 ◽

Author(s):

Mingchao Wang ◽

Xiaogang Ruan ◽

Xiaoqing Zhu

Keyword(s):

Quadruped Robot ◽

Gradient Algorithm ◽

Policy Gradient ◽

Gait Learning

Download Full-text

Hierarchical Reinforcement Learning

ACM Computing Surveys ◽

10.1145/3453160 ◽

2021 ◽

Vol 54 (5) ◽

pp. 1-35

Author(s):

Shubham Pateria ◽

Budhitama Subagdja ◽

Ah-hwee Tan ◽

Chai Quek

Keyword(s):

Reinforcement Learning ◽

Future Research ◽

Comprehensive Overview ◽

Open Problems ◽

Practical Applications ◽

Hierarchical Reinforcement Learning ◽

The Past ◽

Agent Learning ◽

Multi Agent ◽

Supplementary Material

Hierarchical Reinforcement Learning (HRL) enables autonomous decomposition of challenging long-horizon decision-making tasks into simpler subtasks. During the past years, the landscape of HRL research has grown profoundly, resulting in copious approaches. A comprehensive overview of this vast landscape is necessary to study HRL in an organized manner. We provide a survey of the diverse HRL approaches concerning the challenges of learning hierarchical policies, subtask discovery, transfer learning, and multi-agent learning using HRL. The survey is presented according to a novel taxonomy of the approaches. Based on the survey, a set of important open problems is proposed to motivate the future research in HRL. Furthermore, we outline a few suitable task domains for evaluating the HRL approaches and a few interesting examples of the practical applications of HRL in the Supplementary Material.

Download Full-text

Preceding vehicle following algorithm with human driving characteristics

Proceedings of the Institution of Mechanical Engineers Part D Journal of Automobile Engineering ◽

10.1177/0954407020981546 ◽

2021 ◽

pp. 095440702098154

Author(s):

Feng Pan ◽

Hong Bao

Keyword(s):

Reinforcement Learning ◽

Weight Vector ◽

Gradient Algorithm ◽

Inner Product ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Human Driver ◽

Policy Gradient ◽

Preceding Vehicle ◽

Action Spaces

This paper proposes a new approach of using reinforcement learning (RL) to train an agent to perform the task of vehicle following with human driving characteristics. We refer to the ideal of inverse reinforcement learning to design the reward function of the RL model. The factors that need to be weighed in vehicle following were vectorized into reward vectors, and the reward function was defined as the inner product of the reward vector and weights. Driving data of human drivers was collected and analyzed to obtain the true reward function. The RL model was trained with the deterministic policy gradient algorithm because the state and action spaces are continuous. We adjusted the weight vector of the reward function so that the value vector of the RL model could continuously approach that of a human driver. After dozens of rounds of training, we selected the policy with the nearest value vector to that of a human driver and tested it in the PanoSim simulation environment. The results showed the desired performance for the task of an agent following the preceding vehicle safely and smoothly.

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

A Novel Solution to JSPs Based on Long Short-Term Memory and Policy Gradient Algorithm

International Journal of Simulation Modelling ◽

10.2507/ijsimm19-1-co4 ◽

2020 ◽

Vol 19 (1) ◽

pp. 157-168

Author(s):

J. F. Ren ◽

C. M. Ye ◽

F. Yang

Keyword(s):

Short Term Memory ◽

Gradient Algorithm ◽

Short Term ◽

Term Memory ◽

Policy Gradient ◽

Long Short Term Memory

Download Full-text

End-to-end optimization of goal-driven and visually grounded dialogue systems

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/385 ◽

2017 ◽

Cited By ~ 16

Author(s):

Florian Strub ◽

Harm de Vries ◽

Jérémie Mary ◽

Bilal Piot ◽

Aaron Courville ◽

...

Keyword(s):

Gradient Algorithm ◽

Planning Problem ◽

Dialogue Systems ◽

Generation Task ◽

Policy Gradient ◽

Specific Object ◽

Complex Picture ◽

History Of ◽

End To End ◽

Task Oriented

End-to-end design of dialogue systems has recently become a popular research topic thanks to powerful tools such as encoder-decoder architectures for sequence-to-sequence learning. Yet, most current approaches cast human-machine dialogue management as a supervised learning problem, aiming at predicting the next utterance of a participant given the full history of the dialogue. This vision may fail to correctly render the planning problem inherent to dialogue as well as its contextual and grounded nature. In this paper, we introduce a Deep Reinforcement Learning method to optimize visually grounded task-oriented dialogues, based on the policy gradient algorithm. This approach is tested on the question generation task from the dataset GuessWhat?! containing 120k dialogues and provides encouraging results at solving both the problem of generating natural dialogues and the task of discovering a specific object in a complex picture.

Download Full-text