scholarly journals Diversity Evolutionary Policy Deep Reinforcement Learning

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Jian Liu ◽  
Liming Feng

The reinforcement learning algorithms based on policy gradient may fall into local optimal due to gradient disappearance during the update process, which in turn affects the exploration ability of the reinforcement learning agent. In order to solve the above problem, in this paper, the cross-entropy method (CEM) in evolution policy, maximum mean difference (MMD), and twin delayed deep deterministic policy gradient algorithm (TD3) are combined to propose a diversity evolutionary policy deep reinforcement learning (DEPRL) algorithm. By using the maximum mean discrepancy as a measure of the distance between different policies, some of the policies in the population maximize the distance between them and the previous generation of policies while maximizing the cumulative return during the gradient update. Furthermore, combining the cumulative returns and the distance between policies as the fitness of the population encourages more diversity in the offspring policies, which in turn can reduce the risk of falling into local optimal due to the disappearance of the gradient. The results in the MuJoCo test environment show that DEPRL has achieved excellent performance on continuous control tasks; especially in the Ant-v2 environment, the return of DEPRL ultimately achieved a nearly 20% improvement compared to TD3.

2020 ◽  
Vol 34 (04) ◽  
pp. 3316-3323
Author(s):  
Qingpeng Cai ◽  
Ling Pan ◽  
Pingzhong Tang

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.


Actuators ◽  
2021 ◽  
Vol 10 (10) ◽  
pp. 254
Author(s):  
Yangyang Hou ◽  
Huajie Hong ◽  
Dasheng Xu ◽  
Zhe Zeng ◽  
Yaping Chen ◽  
...  

Deep Reinforcement Learning (DRL) has been an active research area in view of its capability in solving large-scale control problems. Until presently, many algorithms have been developed, such as Deep Deterministic Policy Gradient (DDPG), Twin-Delayed Deep Deterministic Policy Gradient (TD3), and so on. However, the converging achievement of DRL often requires extensive collected data sets and training episodes, which is data inefficient and computing resource consuming. Motivated by the above problem, in this paper, we propose a Twin-Delayed Deep Deterministic Policy Gradient algorithm with a Rebirth Mechanism, Tetanic Stimulation and Amnesic Mechanisms (ATRTD3), for continuous control of a multi-DOF manipulator. In the training process of the proposed algorithm, the weighting parameters of the neural network are learned using Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a biomimetic view to speed up the converging process by biochemical reactions generated by neurons in the biological brain during memory and forgetting. The effectiveness of the proposed algorithm is validated by a simulation example including the comparisons with previously developed DRL algorithms. The results indicate that our approach shows performance improvement in terms of convergence speed and precision.


Author(s):  
Feng Pan ◽  
Hong Bao

This paper proposes a new approach of using reinforcement learning (RL) to train an agent to perform the task of vehicle following with human driving characteristics. We refer to the ideal of inverse reinforcement learning to design the reward function of the RL model. The factors that need to be weighed in vehicle following were vectorized into reward vectors, and the reward function was defined as the inner product of the reward vector and weights. Driving data of human drivers was collected and analyzed to obtain the true reward function. The RL model was trained with the deterministic policy gradient algorithm because the state and action spaces are continuous. We adjusted the weight vector of the reward function so that the value vector of the RL model could continuously approach that of a human driver. After dozens of rounds of training, we selected the policy with the nearest value vector to that of a human driver and tested it in the PanoSim simulation environment. The results showed the desired performance for the task of an agent following the preceding vehicle safely and smoothly.


Author(s):  
Shihui Li ◽  
Yi Wu ◽  
Xinyue Cui ◽  
Honghua Dong ◽  
Fei Fang ◽  
...  

Despite the recent advances of deep reinforcement learning (DRL), agents trained by DRL tend to be brittle and sensitive to the training environment, especially in the multi-agent scenarios. In the multi-agent setting, a DRL agent’s policy can easily get stuck in a poor local optima w.r.t. its training partners – the learned policy may be only locally optimal to other agents’ current policies. In this paper, we focus on the problem of training robust DRL agents with continuous actions in the multi-agent learning setting so that the trained agents can still generalize when its opponents’ policies alter. To tackle this problem, we proposed a new algorithm, MiniMax Multi-agent Deep Deterministic Policy Gradient (M3DDPG) with the following contributions: (1) we introduce a minimax extension of the popular multi-agent deep deterministic policy gradient algorithm (MADDPG), for robust policy learning; (2) since the continuous action space leads to computational intractability in our minimax learning objective, we propose Multi-Agent Adversarial Learning (MAAL) to efficiently solve our proposed formulation. We empirically evaluate our M3DDPG algorithm in four mixed cooperative and competitive multi-agent environments and the agents trained by our method significantly outperforms existing baselines.


Symmetry ◽  
2019 ◽  
Vol 11 (11) ◽  
pp. 1352 ◽  
Author(s):  
Kim ◽  
Park

In terms of deep reinforcement learning (RL), exploration is highly significant in achieving better generalization. In benchmark studies, ε-greedy random actions have been used to encourage exploration and prevent over-fitting, thereby improving generalization. Deep RL with random ε-greedy policies, such as deep Q-networks (DQNs), can demonstrate efficient exploration behavior. A random ε-greedy policy exploits additional replay buffers in an environment of sparse and binary rewards, such as in the real-time online detection of network securities by verifying whether the network is “normal or anomalous.” Prior studies have illustrated that a prioritized replay memory attributed to a complex temporal difference error provides superior theoretical results. However, another implementation illustrated that in certain environments, the prioritized replay memory is not superior to the randomly-selected buffers of random ε-greedy policy. Moreover, a key challenge of hindsight experience replay inspires our objective by using additional buffers corresponding to each different goal. Therefore, we attempt to exploit multiple random ε-greedy buffers to enhance explorations for a more near-perfect generalization with one original goal in off-policy RL. We demonstrate the benefit of off-policy learning from our method through an experimental comparison of DQN and a deep deterministic policy gradient in terms of discrete action, as well as continuous control for complete symmetric environments.


Information ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 295 ◽  
Author(s):  
Xinpeng Wang ◽  
Chaozhong Wu ◽  
Jie Xue ◽  
Zhijun Chen

To date, automatic driving technology has become a hotspot in academia. It is necessary to provide a personalization of automatic driving decision for each passenger. The purpose of this paper is to propose a self-learning method for personalized driving decisions. First, collect and analyze driving data from different drivers to set learning goals. Then, Deep Deterministic Policy Gradient algorithm is utilized to design a driving decision system. Furthermore, personalized factors are introduced for some observed parameters to build a personalized driving decision model. Finally, compare the proposed method with classic Deep Reinforcement Learning algorithms. The results show that the performance of the personalized driving decision model is better than the classic algorithms, and it is similar to the manual driving situation. Therefore, the proposed model can effectively learn the human-like personalized driving decisions of different drivers for structured road. Based on this model, the smart car can accomplish personalized driving.


2020 ◽  
pp. 107754632093014
Author(s):  
Xue-She Wang ◽  
James D Turner ◽  
Brian P Mann

This study describes an approach for attractor selection (or multistability control) in nonlinear dynamical systems with constrained actuation. Attractor selection is obtained using two different deep reinforcement learning methods: (1) the cross-entropy method and (2) the deep deterministic policy gradient method. The framework and algorithms for applying these control methods are presented. Experiments were performed on a Duffing oscillator, as it is a classic nonlinear dynamical system with multiple attractors. Both methods achieve attractor selection under various control constraints. Although these methods have nearly identical success rates, the deep deterministic policy gradient method has the advantages of a high learning rate, low performance variance, and a smooth control approach. This study demonstrates the ability of two reinforcement learning approaches to achieve constrained attractor selection.


2020 ◽  
Vol 17 (1) ◽  
pp. 172988141989834
Author(s):  
Guoyu Zuo ◽  
Qishen Zhao ◽  
Jiahao Lu ◽  
Jiangeng Li

The goal of reinforcement learning is to enable an agent to learn by using rewards. However, some robotic tasks naturally specify with sparse rewards, and manually shaping reward functions is a difficult project. In this article, we propose a general and model-free approach for reinforcement learning to learn robotic tasks with sparse rewards. First, a variant of Hindsight Experience Replay, Curious and Aggressive Hindsight Experience Replay, is proposed to improve the sample efficiency of reinforcement learning methods and avoid the need for complicated reward engineering. Second, based on Twin Delayed Deep Deterministic policy gradient algorithm, demonstrations are leveraged to overcome the exploration problem and speed up the policy training process. Finally, the action loss is added into the loss function in order to minimize the vibration of output action while maximizing the value of the action. The experiments on simulated robotic tasks are performed with different hyperparameters to verify the effectiveness of our method. Results show that our method can effectively solve the sparse reward problem and obtain a high learning speed.


Sign in / Sign up

Export Citation Format

Share Document