scholarly journals Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies

Author(s):  
Muhammad Masood ◽  
Finale Doshi-Velez

Standard reinforcement learning methods aim to master one way of solving a task whereas there may exist multiple near-optimal policies. Being able to identify this collection of near-optimal policies can allow a domain expert to efficiently explore the space of reasonable solutions.  Unfortunately, existing approaches that quantify uncertainty over policies are not ultimately relevant to finding policies with qualitatively distinct behaviors.  In this work, we formalize the difference between policies as a difference between the distribution of trajectories induced by each policy, which encourages diversity with respect to both state visitation and action choices.  We derive a gradient-based optimization technique that can be combined with existing policy gradient methods to now identify diverse collections of well-performing policies.  We demonstrate our approach on benchmarks and a healthcare task.

2021 ◽  
Vol 12 (3) ◽  
pp. 1-21
Author(s):  
Shilei Li ◽  
Meng Li ◽  
Jiongming Su ◽  
Shaofei Chen ◽  
Zhimin Yuan ◽  
...  

Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.


Nanophotonics ◽  
2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Sean Hooten ◽  
Raymond G. Beausoleil ◽  
Thomas Van Vaerenbergh

Abstract We present a proof-of-concept technique for the inverse design of electromagnetic devices motivated by the policy gradient method in reinforcement learning, named PHORCED (PHotonic Optimization using REINFORCE Criteria for Enhanced Design). This technique uses a probabilistic generative neural network interfaced with an electromagnetic solver to assist in the design of photonic devices, such as grating couplers. We show that PHORCED obtains better performing grating coupler designs than local gradient-based inverse design via the adjoint method, while potentially providing faster convergence over competing state-of-the-art generative methods. As a further example of the benefits of this method, we implement transfer learning with PHORCED, demonstrating that a neural network trained to optimize 8° grating couplers can then be re-trained on grating couplers with alternate scattering angles while requiring >10× fewer simulations than control cases.


Author(s):  
Wenjie Shi ◽  
Shiji Song ◽  
Cheng Wu

Maximum entropy deep reinforcement learning (RL) methods have been demonstrated on a range of challenging continuous tasks. However, existing methods either suffer from severe instability when training on large off-policy data or cannot scale to tasks with very high state and action dimensionality such as 3D humanoid locomotion. Besides, the optimality of desired Boltzmann policy set for non-optimal soft value function is not persuasive enough. In this paper, we first derive soft policy gradient based on entropy regularized expected reward objective for RL with continuous actions. Then, we present an off-policy actor-critic, model-free maximum entropy deep RL algorithm called deep soft policy gradient (DSPG) by combining soft policy gradient with soft Bellman equation. To ensure stable learning while eliminating the need of two separate critics for soft value functions, we leverage double sampling approach to making the soft Bellman equation tractable. The experimental results demonstrate that our method outperforms in performance over off-policy prior methods.


Electronics ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 37
Author(s):  
Hai-Tao Yu ◽  
Degen Huang ◽  
Fuji Ren ◽  
Lishuang Li

Learning-to-rank has been intensively studied and has shown significantly increasing values in a wide range of domains, such as web search, recommender systems, dialogue systems, machine translation, and even computational biology, to name a few. In light of recent advances in neural networks, there has been a strong and continuing interest in exploring how to deploy popular techniques, such as reinforcement learning and adversarial learning, to solve ranking problems. However, armed with the aforesaid popular techniques, most studies tend to show how effective a new method is. A comprehensive comparison between techniques and an in-depth analysis of their deficiencies are somehow overlooked. This paper is motivated by the observation that recent ranking methods based on either reinforcement learning or adversarial learning boil down to policy-gradient-based optimization. Based on the widely used benchmark collections with complete information (where relevance labels are known for all items), such as MSLRWEB30K and Yahoo-Set1, we thoroughly investigate the extent to which policy-gradient-based ranking methods are effective. On one hand, we analytically identify the pitfalls of policy-gradient-based ranking. On the other hand, we experimentally compare a wide range of representative methods. The experimental results echo our analysis and show that policy-gradient-based ranking methods are, by a large margin, inferior to many conventional ranking methods. Regardless of whether we use reinforcement learning or adversarial learning, the failures are largely attributable to the gradient estimation based on sampled rankings, which significantly diverge from ideal rankings. In particular, the larger the number of documents per query and the more fine-grained the ground-truth labels, the greater the impact policy-gradient-based ranking suffers. Careful examination of this weakness is highly recommended for developing enhanced methods based on policy gradient.


2009 ◽  
Vol 5 (12) ◽  
pp. e1000586 ◽  
Author(s):  
Eleni Vasilaki ◽  
Nicolas Frémaux ◽  
Robert Urbanczik ◽  
Walter Senn ◽  
Wulfram Gerstner

Author(s):  
Sven Gronauer ◽  
Martin Gottwald ◽  
Klaus Diepold

Despite the sublime success in recent years, the underlying mechanisms powering the advances of reinforcement learning are yet poorly understood. In this paper, we identify these mechanisms - which we call ingredients - in on-policy policy gradient methods and empirically determine their impact on the learning. To allow an equitable assessment, we conduct our experiments based on a unified and modular implementation. Our results underline the significance of recent algorithmic advances and demonstrate that reaching state-of-the-art performance may not need sophisticated algorithms but can also be accomplished by the combination of a few simple ingredients.


2020 ◽  
Vol 58 (6) ◽  
pp. 3586-3612
Author(s):  
Kaiqing Zhang ◽  
Alec Koppel ◽  
Hao Zhu ◽  
Tamer Başar

Sign in / Sign up

Export Citation Format

Share Document