Reinforcement Learning based on MPC and the Stochastic Policy Gradient Method

Abstract We present a proof-of-concept technique for the inverse design of electromagnetic devices motivated by the policy gradient method in reinforcement learning, named PHORCED (PHotonic Optimization using REINFORCE Criteria for Enhanced Design). This technique uses a probabilistic generative neural network interfaced with an electromagnetic solver to assist in the design of photonic devices, such as grating couplers. We show that PHORCED obtains better performing grating coupler designs than local gradient-based inverse design via the adjoint method, while potentially providing faster convergence over competing state-of-the-art generative methods. As a further example of the benefits of this method, we implement transfer learning with PHORCED, demonstrating that a neural network trained to optimize 8° grating couplers can then be re-trained on grating couplers with alternate scattering angles while requiring >10× fewer simulations than control cases.

Download Full-text

Constrained attractor selection using deep reinforcement learning

Journal of Vibration and Control ◽

10.1177/1077546320930144 ◽

2020 ◽

pp. 107754632093014

Author(s):

Xue-She Wang ◽

James D Turner ◽

Brian P Mann

Keyword(s):

Reinforcement Learning ◽

Gradient Method ◽

Nonlinear Dynamical Systems ◽

Nonlinear Dynamical System ◽

Learning Approaches ◽

Multiple Attractors ◽

Nonlinear Dynamical ◽

Cross Entropy Method ◽

Policy Gradient ◽

Attractor Selection

This study describes an approach for attractor selection (or multistability control) in nonlinear dynamical systems with constrained actuation. Attractor selection is obtained using two different deep reinforcement learning methods: (1) the cross-entropy method and (2) the deep deterministic policy gradient method. The framework and algorithms for applying these control methods are presented. Experiments were performed on a Duffing oscillator, as it is a classic nonlinear dynamical system with multiple attractors. Both methods achieve attractor selection under various control constraints. Although these methods have nearly identical success rates, the deep deterministic policy gradient method has the advantages of a high learning rate, low performance variance, and a smooth control approach. This study demonstrates the ability of two reinforcement learning approaches to achieve constrained attractor selection.

Download Full-text

Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i03.5655 ◽

2020 ◽

Vol 34 (03) ◽

pp. 2693-2700

Author(s):

Paul Hongsuck Seo ◽

Piyush Sharma ◽

Tomer Levinboim ◽

Bohyung Han ◽

Radu Soricut

Keyword(s):

Reinforcement Learning ◽

Gradient Method ◽

Training Data ◽

Evaluation Procedure ◽

Image Captioning ◽

Human Evaluation ◽

Policy Gradient ◽

Evaluation Dataset ◽

Image Caption

Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, where policy gradients are estimated by samples from a distribution that focuses on the captions in a caption ratings dataset. Our empirical evidence indicates that the proposed method learns to generalize the human raters' judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multi-dimensional side-by-side human evaluation procedure.

Download Full-text