Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents

Vanilla policy gradient methods suffer from high variance, leading to unstable policies during training, where the policy’s performance fluctuates drastically between iterations. To address this issue, we analyze the policy optimization process of the navigation method based on deep reinforcement learning (DRL) that uses asynchronous gradient descent for optimization. A variant navigation (asynchronous proximal policy optimization navigation, appoNav) is presented that can guarantee the policy monotonic improvement during the process of policy optimization. Our experiments are tested in DeepMind Lab, and the experimental results show that the artificial agents with appoNav perform better than the compared algorithm.

Download Full-text

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Operations Research ◽

10.1287/opre.2021.2151 ◽

2021 ◽

Author(s):

Shicong Cen ◽

Chen Cheng ◽

Yuxin Chen ◽

Yuting Wei ◽

Yuejie Chi

Keyword(s):

Reinforcement Learning ◽

Global Convergence ◽

Policy Evaluation ◽

Gradient Methods ◽

Convergence Result ◽

Learning Rates ◽

Wide Range ◽

Policy Gradient ◽

Markov Decision ◽

Policy Optimization

Preconditioning and Regularization Enable Faster Reinforcement Learning Natural policy gradient (NPG) methods, in conjunction with entropy regularization to encourage exploration, are among the most popular policy optimization algorithms in contemporary reinforcement learning. Despite the empirical success, the theoretical underpinnings for NPG methods remain severely limited. In “Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization”, Cen, Cheng, Chen, Wei, and Chi develop nonasymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on tabular discounted Markov decision processes. Assuming access to exact policy evaluation, the authors demonstrate that the algorithm converges linearly at an astonishing rate that is independent of the dimension of the state-action space. Moreover, the algorithm is provably stable vis-à-vis inexactness of policy evaluation. Accommodating a wide range of learning rates, this convergence result highlights the role of preconditioning and regularization in enabling fast convergence.

Download Full-text

Air Learning: a deep reinforcement learning gym for autonomous aerial robot visual navigation

Machine Learning ◽

10.1007/s10994-021-06006-6 ◽

2021 ◽

Author(s):

Srivatsan Krishnan ◽

Behzad Boroujerdian ◽

William Fu ◽

Aleksandra Faust ◽

Vijay Janapa Reddi

Keyword(s):

Reinforcement Learning ◽

Embedded System ◽

Broad Class ◽

Visual Navigation ◽

Raspberry Pi ◽

Latency Distribution ◽

Hardware In The Loop ◽

Resource Constrained ◽

Aerial Robot ◽

Policy Optimization

AbstractWe introduce Air Learning, an open-source simulator, and a gym environment for deep reinforcement learning research on resource-constrained aerial robots. Equipped with domain randomization, Air Learning exposes a UAV agent to a diverse set of challenging scenarios. We seed the toolset with point-to-point obstacle avoidance tasks in three different environments and Deep Q Networks (DQN) and Proximal Policy Optimization (PPO) trainers. Air Learning assesses the policies’ performance under various quality-of-flight (QoF) metrics, such as the energy consumed, endurance, and the average trajectory length, on resource-constrained embedded platforms like a Raspberry Pi. We find that the trajectories on an embedded Ras-Pi are vastly different from those predicted on a high-end desktop system, resulting in up to $$40\%$$ 40 % longer trajectories in one of the environments. To understand the source of such discrepancies, we use Air Learning to artificially degrade high-end desktop performance to mimic what happens on a low-end embedded system. We then propose a mitigation technique that uses the hardware-in-the-loop to determine the latency distribution of running the policy on the target platform (onboard compute on aerial robot). A randomly sampled latency from the latency distribution is then added as an artificial delay within the training loop. Training the policy with artificial delays allows us to minimize the hardware gap (discrepancy in the flight time metric reduced from 37.73% to 0.5%). Thus, Air Learning with hardware-in-the-loop characterizes those differences and exposes how the onboard compute’s choice affects the aerial robot’s performance. We also conduct reliability studies to assess the effect of sensor failures on the learned policies. All put together, Air Learning enables a broad class of deep RL research on UAVs. The source code is available at: https://github.com/harvard-edge/AirLearning.

Download Full-text

Robot reinforcement learning accuracy-based learning classifier systems with Fuzzy Policy Gradient descent(XCS-FPGRL)

Proceedings of the 2015 International Conference on Advances in Mechanical Engineering and Industrial Informatics ◽

10.2991/ameii-15.2015.187 ◽

2015 ◽

Author(s):

Jie Shao ◽

Jingru Yu

Keyword(s):

Reinforcement Learning ◽

Gradient Descent ◽

Learning Classifier Systems ◽

Classifier Systems ◽

Learning Classifier ◽

Policy Gradient

Download Full-text

Quadrotor Motion Control Using Deep Reinforcement Learning

Journal of Unmanned Vehicle Systems ◽

10.1139/juvs-2021-0010 ◽

2021 ◽

Author(s):

Zifei Jiang ◽

Alan F. Lynch

Keyword(s):

Reinforcement Learning ◽

Neural Nets ◽

Neural Net ◽

Reward Function ◽

Model Free ◽

Policy Gradient ◽

Aerial Vehicle ◽

Stochastic Controller ◽

Policy Optimization ◽

Gradient Approach

We present a deep neural net-based controller trained by a model-free reinforcement learning (RL) algorithm to achieve hover stabilization for a quadrotor unmanned aerial vehicle (UAV). With RL, two neural nets are trained. One neural net is used as a stochastic controller which gives the distribution of control inputs. The other maps the UAV state to a scalar which estimates the reward of the controller. A proximal policy optimization (PPO) method, which is an actor-critic policy gradient approach, is used to train the neural nets. Simulation results show that the trained controller achieves a comparable level of performance to a manually-tuned PID controller, despite not depending on any model information. The paper considers different choices of reward function and their influence on controller performance.

Download Full-text

A Method of Personalized Driving Decision for Smart Car Based on Deep Reinforcement Learning

Information ◽

10.3390/info11060295 ◽

2020 ◽

Vol 11 (6) ◽

pp. 295 ◽

Cited By ~ 1

Author(s):

Xinpeng Wang ◽

Chaozhong Wu ◽

Jie Xue ◽

Zhijun Chen

Keyword(s):

Reinforcement Learning ◽

Decision Model ◽

Gradient Algorithm ◽

Learning Goals ◽

Learning Method ◽

Automatic Driving ◽

Proposed Model ◽

Policy Gradient ◽

Self Learning ◽

Better Than

To date, automatic driving technology has become a hotspot in academia. It is necessary to provide a personalization of automatic driving decision for each passenger. The purpose of this paper is to propose a self-learning method for personalized driving decisions. First, collect and analyze driving data from different drivers to set learning goals. Then, Deep Deterministic Policy Gradient algorithm is utilized to design a driving decision system. Furthermore, personalized factors are introduced for some observed parameters to build a personalized driving decision model. Finally, compare the proposed method with classic Deep Reinforcement Learning algorithms. The results show that the performance of the personalized driving decision model is better than the classic algorithms, and it is similar to the manual driving situation. Therefore, the proposed model can effectively learn the human-like personalized driving decisions of different drivers for structured road. Based on this model, the smart car can accomplish personalized driving.

Download Full-text

A Survey on Visual Navigation for Artificial Agents With Deep Reinforcement Learning

IEEE Access ◽

10.1109/access.2020.3011438 ◽

2020 ◽

Vol 8 ◽

pp. 135426-135442 ◽

Cited By ~ 1

Author(s):

Fanyu Zeng ◽

Chen Wang ◽

Shuzhi Sam Ge

Keyword(s):

Reinforcement Learning ◽

Visual Navigation ◽

Artificial Agents

Download Full-text

When does reinforcement learning stand out in quantum control? A comparative study on state preparation

npj Quantum Information ◽

10.1038/s41534-019-0201-8 ◽

2019 ◽

Vol 5 (1) ◽

Cited By ~ 9

Author(s):

Xiao-Ming Zhang ◽

Zezhu Wei ◽

Raza Asad ◽

Xu-Chen Yang ◽

Xin Wang

Keyword(s):

Machine Learning ◽

Reinforcement Learning ◽

Comparative Study ◽

Gradient Descent ◽

Quantum Control ◽

Learning Algorithms ◽

Stochastic Gradient Descent ◽

Q Learning ◽

Machine Learning Methods ◽

Policy Gradient

Abstract Reinforcement learning has been widely used in many problems, including quantum control of qubits. However, such problems can, at the same time, be solved by traditional, non-machine-learning methods, such as stochastic gradient descent and Krotov algorithms, and it remains unclear which one is most suitable when the control has specific constraints. In this work, we perform a comparative study on the efficacy of three reinforcement learning algorithms: tabular Q-learning, deep Q-learning, and policy gradient, as well as two non-machine-learning methods: stochastic gradient descent and Krotov algorithms, in the problem of preparing a desired quantum state. We found that overall, the deep Q-learning and policy gradient algorithms outperform others when the problem is discretized, e.g. allowing discrete values of control, and when the problem scales up. The reinforcement learning algorithms can also adaptively reduce the complexity of the control sequences, shortening the operation time and improving the fidelity. Our comparison provides insights into the suitability of reinforcement learning in quantum control problems.

Download Full-text

Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

PLoS Computational Biology ◽

10.1371/journal.pcbi.1000586 ◽

2009 ◽

Vol 5 (12) ◽

pp. e1000586 ◽

Cited By ~ 55

Author(s):

Eleni Vasilaki ◽

Nicolas Frémaux ◽

Robert Urbanczik ◽

Walter Senn ◽

Wulfram Gerstner

Keyword(s):

Reinforcement Learning ◽

Gradient Methods ◽

Action Space ◽

Continuous State ◽

Policy Gradient

Download Full-text

Reinforcement Learning in Linear Quadratic Deep Structured Teams: Global Convergence of Policy Gradient Methods

2020 59th IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc42340.2020.9304234 ◽

2020 ◽

Author(s):

Vida Fathi ◽

Jalal Arabneydi ◽

Amir G. Aghdam

Keyword(s):

Reinforcement Learning ◽

Global Convergence ◽

Gradient Methods ◽

Linear Quadratic ◽

Policy Gradient

Download Full-text

Policy Search by Target Distribution Learning for Continuous Control

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6156 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6770-6777

Author(s):

Chuheng Zhang ◽

Yuanqi Li ◽

Jian Li

Keyword(s):

State Of The Art ◽

Gradient Methods ◽

Continuous Control ◽

Policy Network ◽

Current Policy ◽

Training Process ◽

Target Distribution ◽

Policy Gradient ◽

And Training ◽

Better Than

It is known that existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic, leading to an unstable training process. We show that such instability can happen even in a very simple environment. To address this issue, we propose a new method, called target distribution learning (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training.

Download Full-text