PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Shilei Li; Meng Li; Jiongming Su; Shaofei Chen; Zhimin Yuan; Qing Ye

doi:10.1145/3452008

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/3452008 ◽

2021 ◽

Vol 12 (3) ◽

pp. 1-21

Author(s):

Shilei Li ◽

Meng Li ◽

Jiongming Su ◽

Shaofei Chen ◽

Zhimin Yuan ◽

...

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Gradient Methods ◽

Action Space ◽

Fine Tuning ◽

Continuous Control ◽

Parametric Perturbation ◽

Gradient Information ◽

Policy Gradient ◽

Gradient Based

Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

Download Full-text

Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

PLoS Computational Biology ◽

10.1371/journal.pcbi.1000586 ◽

2009 ◽

Vol 5 (12) ◽

pp. e1000586 ◽

Cited By ~ 55

Author(s):

Eleni Vasilaki ◽

Nicolas Frémaux ◽

Robert Urbanczik ◽

Walter Senn ◽

Wulfram Gerstner

Keyword(s):

Reinforcement Learning ◽

Gradient Methods ◽

Action Space ◽

Continuous State ◽

Policy Gradient

Download Full-text

Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/821 ◽

2019 ◽

Author(s):

Muhammad Masood ◽

Finale Doshi-Velez

Keyword(s):

Reinforcement Learning ◽

Optimization Technique ◽

Gradient Methods ◽

Domain Expert ◽

Learning Methods ◽

Maximum Mean Discrepancy ◽

Optimal Policies ◽

Policy Gradient ◽

Gradient Based ◽

The Difference

Standard reinforcement learning methods aim to master one way of solving a task whereas there may exist multiple near-optimal policies. Being able to identify this collection of near-optimal policies can allow a domain expert to efficiently explore the space of reasonable solutions. Unfortunately, existing approaches that quantify uncertainty over policies are not ultimately relevant to finding policies with qualitatively distinct behaviors. In this work, we formalize the difference between policies as a difference between the distribution of trajectories induced by each policy, which encourages diversity with respect to both state visitation and action choices. We derive a gradient-based optimization technique that can be combined with existing policy gradient methods to now identify diverse collections of well-performing policies. We demonstrate our approach on benchmarks and a healthcare task.

Download Full-text

Automated Design of Energy Efficient Control Strategies for Building Clusters Using Reinforcement Learning

Journal of Mechanical Design ◽

10.1115/1.4041629 ◽

2018 ◽

Vol 141 (2) ◽

Cited By ~ 2

Author(s):

Philip Odonkor ◽

Kemper Lewis

Keyword(s):

Reinforcement Learning ◽

Control Strategies ◽

Action Space ◽

Continuous Control ◽

Energy Demands ◽

Current State ◽

Decision Cycle ◽

Policy Gradient ◽

Discrete Action ◽

Energy Assets

The control of shared energy assets within building clusters has traditionally been confined to a discrete action space, owing in part to a computationally intractable decision space. In this work, we leverage the current state of the art in reinforcement learning (RL) for continuous control tasks, the deep deterministic policy gradient (DDPG) algorithm, toward addressing this limitation. The goals of this paper are twofold: (i) to design an efficient charged/discharged dispatch policy for a shared battery system within a building cluster and (ii) to address the continuous domain task of determining how much energy should be charged/discharged at each decision cycle. Experimentally, our results demonstrate an ability to exploit factors such as energy arbitrage, along with the continuous action space toward demand peak minimization. This approach is shown to be computationally tractable, achieving efficient results after only 5 h of simulation. Additionally, the agent showed an ability to adapt to different building clusters, designing unique control strategies to address the energy demands of the clusters studied.

Download Full-text

A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator

Actuators ◽

10.3390/act10100254 ◽

2021 ◽

Vol 10 (10) ◽

pp. 254

Author(s):

Yangyang Hou ◽

Huajie Hong ◽

Dasheng Xu ◽

Zhe Zeng ◽

Yaping Chen ◽

...

Keyword(s):

Reinforcement Learning ◽

Large Scale ◽

Learning Algorithm ◽

Research Area ◽

Gradient Algorithm ◽

Data Sets ◽

Continuous Control ◽

Tetanic Stimulation ◽

Policy Gradient ◽

Active Research

Deep Reinforcement Learning (DRL) has been an active research area in view of its capability in solving large-scale control problems. Until presently, many algorithms have been developed, such as Deep Deterministic Policy Gradient (DDPG), Twin-Delayed Deep Deterministic Policy Gradient (TD3), and so on. However, the converging achievement of DRL often requires extensive collected data sets and training episodes, which is data inefficient and computing resource consuming. Motivated by the above problem, in this paper, we propose a Twin-Delayed Deep Deterministic Policy Gradient algorithm with a Rebirth Mechanism, Tetanic Stimulation and Amnesic Mechanisms (ATRTD3), for continuous control of a multi-DOF manipulator. In the training process of the proposed algorithm, the weighting parameters of the neural network are learned using Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a biomimetic view to speed up the converging process by biochemical reactions generated by neurons in the biological brain during memory and forgetting. The effectiveness of the proposed algorithm is validated by a simulation example including the comparisons with previously developed DRL algorithms. The results indicate that our approach shows performance improvement in terms of convergence speed and precision.

Download Full-text

Correction: Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

PLoS Computational Biology ◽

10.1371/annotation/307ea250-3792-4ceb-b905-162d86c96baf ◽

2009 ◽

Vol 5 (12) ◽

Cited By ~ 2

Author(s):

Eleni Vasilaki ◽

Nicolas Frémaux ◽

Robert Urbanczik ◽

Walter Senn ◽

Wulfram Gerstner

Keyword(s):

Reinforcement Learning ◽

Gradient Methods ◽

Action Space ◽

Continuous State ◽

Policy Gradient

Download Full-text

A Gradient-Based Reinforcement Learning Algorithm for Multiple Cooperative Agents

IEEE Access ◽

10.1109/access.2018.2878853 ◽

2018 ◽

Vol 6 ◽

pp. 70223-70235 ◽

Cited By ~ 2

Author(s):

Zhen Zhang ◽

Dongqing Wang ◽

Dongbin Zhao ◽

Qiaoni Han ◽

Tingting Song

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Cooperative Agents ◽

Gradient Based ◽

Reinforcement Learning Algorithm

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

Inverse design of grating couplers using the policy gradient method from reinforcement learning

Nanophotonics ◽

10.1515/nanoph-2021-0332 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Sean Hooten ◽

Raymond G. Beausoleil ◽

Thomas Van Vaerenbergh

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Gradient Method ◽

Photonic Devices ◽

Inverse Design ◽

Grating Couplers ◽

Electromagnetic Devices ◽

Policy Gradient ◽

Gradient Based ◽

Local Gradient

Abstract We present a proof-of-concept technique for the inverse design of electromagnetic devices motivated by the policy gradient method in reinforcement learning, named PHORCED (PHotonic Optimization using REINFORCE Criteria for Enhanced Design). This technique uses a probabilistic generative neural network interfaced with an electromagnetic solver to assist in the design of photonic devices, such as grating couplers. We show that PHORCED obtains better performing grating coupler designs than local gradient-based inverse design via the adjoint method, while potentially providing faster convergence over competing state-of-the-art generative methods. As a further example of the benefits of this method, we implement transfer learning with PHORCED, demonstrating that a neural network trained to optimize 8° grating couplers can then be re-trained on grating couplers with alternate scattering angles while requiring >10× fewer simulations than control cases.

Download Full-text

Policy Gradient-based Integral Reinforcement Learning for Optimal Control Design of Nonaffine Morphing Aircraft Systems

2020 28th Mediterranean Conference on Control and Automation (MED) ◽

10.1109/med48518.2020.9183024 ◽

2020 ◽

Author(s):

Hanna Lee ◽

Seong-Hun Kim ◽

Youdan Kim

Keyword(s):

Optimal Control ◽

Reinforcement Learning ◽

Control Design ◽

Morphing Aircraft ◽

Aircraft Systems ◽

Policy Gradient ◽

Gradient Based

Download Full-text

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/475 ◽

2019 ◽

Cited By ~ 1

Author(s):

Wenjie Shi ◽

Shiji Song ◽

Cheng Wu

Keyword(s):

Reinforcement Learning ◽

Maximum Entropy ◽

Bellman Equation ◽

Value Functions ◽

Policy Actor ◽

Model Free ◽

Policy Gradient ◽

Gradient Based ◽

Continuous Actions ◽

Stable Learning

Maximum entropy deep reinforcement learning (RL) methods have been demonstrated on a range of challenging continuous tasks. However, existing methods either suffer from severe instability when training on large off-policy data or cannot scale to tasks with very high state and action dimensionality such as 3D humanoid locomotion. Besides, the optimality of desired Boltzmann policy set for non-optimal soft value function is not persuasive enough. In this paper, we first derive soft policy gradient based on entropy regularized expected reward objective for RL with continuous actions. Then, we present an off-policy actor-critic, model-free maximum entropy deep RL algorithm called deep soft policy gradient (DSPG) by combining soft policy gradient with soft Bellman equation. To ensure stable learning while eliminating the need of two separate critics for soft value functions, we leverage double sampling approach to making the soft Bellman equation tractable. The experimental results demonstrate that our method outperforms in performance over off-policy prior methods.

Download Full-text