Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning

Author(s):  
Kaiqing Zhang ◽  
Alec Koppel ◽  
Hao Zhu ◽  
Tamer Basar
2020 ◽  
Vol 34 (04) ◽  
pp. 3316-3323
Author(s):  
Qingpeng Cai ◽  
Ling Pan ◽  
Pingzhong Tang

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.


Nanophotonics ◽  
2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Sean Hooten ◽  
Raymond G. Beausoleil ◽  
Thomas Van Vaerenbergh

Abstract We present a proof-of-concept technique for the inverse design of electromagnetic devices motivated by the policy gradient method in reinforcement learning, named PHORCED (PHotonic Optimization using REINFORCE Criteria for Enhanced Design). This technique uses a probabilistic generative neural network interfaced with an electromagnetic solver to assist in the design of photonic devices, such as grating couplers. We show that PHORCED obtains better performing grating coupler designs than local gradient-based inverse design via the adjoint method, while potentially providing faster convergence over competing state-of-the-art generative methods. As a further example of the benefits of this method, we implement transfer learning with PHORCED, demonstrating that a neural network trained to optimize 8° grating couplers can then be re-trained on grating couplers with alternate scattering angles while requiring >10× fewer simulations than control cases.


2020 ◽  
pp. 107754632093014
Author(s):  
Xue-She Wang ◽  
James D Turner ◽  
Brian P Mann

This study describes an approach for attractor selection (or multistability control) in nonlinear dynamical systems with constrained actuation. Attractor selection is obtained using two different deep reinforcement learning methods: (1) the cross-entropy method and (2) the deep deterministic policy gradient method. The framework and algorithms for applying these control methods are presented. Experiments were performed on a Duffing oscillator, as it is a classic nonlinear dynamical system with multiple attractors. Both methods achieve attractor selection under various control constraints. Although these methods have nearly identical success rates, the deep deterministic policy gradient method has the advantages of a high learning rate, low performance variance, and a smooth control approach. This study demonstrates the ability of two reinforcement learning approaches to achieve constrained attractor selection.


2020 ◽  
Vol 34 (03) ◽  
pp. 2693-2700
Author(s):  
Paul Hongsuck Seo ◽  
Piyush Sharma ◽  
Tomer Levinboim ◽  
Bohyung Han ◽  
Radu Soricut

Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, where policy gradients are estimated by samples from a distribution that focuses on the captions in a caption ratings dataset. Our empirical evidence indicates that the proposed method learns to generalize the human raters' judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multi-dimensional side-by-side human evaluation procedure.


Sign in / Sign up

Export Citation Format

Share Document