scholarly journals Reinforcing an Image Caption Generator Using Off-Line Human Feedback

2020 ◽  
Vol 34 (03) ◽  
pp. 2693-2700
Author(s):  
Paul Hongsuck Seo ◽  
Piyush Sharma ◽  
Tomer Levinboim ◽  
Bohyung Han ◽  
Radu Soricut

Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, where policy gradients are estimated by samples from a distribution that focuses on the captions in a caption ratings dataset. Our empirical evidence indicates that the proposed method learns to generalize the human raters' judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multi-dimensional side-by-side human evaluation procedure.

Nanophotonics ◽  
2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Sean Hooten ◽  
Raymond G. Beausoleil ◽  
Thomas Van Vaerenbergh

Abstract We present a proof-of-concept technique for the inverse design of electromagnetic devices motivated by the policy gradient method in reinforcement learning, named PHORCED (PHotonic Optimization using REINFORCE Criteria for Enhanced Design). This technique uses a probabilistic generative neural network interfaced with an electromagnetic solver to assist in the design of photonic devices, such as grating couplers. We show that PHORCED obtains better performing grating coupler designs than local gradient-based inverse design via the adjoint method, while potentially providing faster convergence over competing state-of-the-art generative methods. As a further example of the benefits of this method, we implement transfer learning with PHORCED, demonstrating that a neural network trained to optimize 8° grating couplers can then be re-trained on grating couplers with alternate scattering angles while requiring >10× fewer simulations than control cases.


Author(s):  
Zhan Shi ◽  
Xinchi Chen ◽  
Xipeng Qiu ◽  
Xuanjing Huang

Text generation is a crucial task in NLP. Recently, several adversarial generative models have been proposed to improve the exposure bias problem in text generation. Though these models gain great success, they still suffer from the problems of reward sparsity and mode collapse. In order to address these two problems, in this paper, we employ inverse reinforcement learning (IRL) for text generation. Specifically, the IRL framework learns a reward function on training data, and then an optimal policy to maximum the expected total reward. Similar to the adversarial models, the reward and policy function in IRL are optimized alternately. Our method has two advantages: (1) the reward function can produce more dense reward signals. (2) the generation policy, trained by ``entropy regularized'' policy gradient, encourages to generate more diversified texts. Experiment results demonstrate that our proposed method can generate higher quality texts than the previous methods.


2020 ◽  
pp. 107754632093014
Author(s):  
Xue-She Wang ◽  
James D Turner ◽  
Brian P Mann

This study describes an approach for attractor selection (or multistability control) in nonlinear dynamical systems with constrained actuation. Attractor selection is obtained using two different deep reinforcement learning methods: (1) the cross-entropy method and (2) the deep deterministic policy gradient method. The framework and algorithms for applying these control methods are presented. Experiments were performed on a Duffing oscillator, as it is a classic nonlinear dynamical system with multiple attractors. Both methods achieve attractor selection under various control constraints. Although these methods have nearly identical success rates, the deep deterministic policy gradient method has the advantages of a high learning rate, low performance variance, and a smooth control approach. This study demonstrates the ability of two reinforcement learning approaches to achieve constrained attractor selection.


2020 ◽  
Author(s):  
◽  
Dongpeng Liu

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] The conventional material research and development are mainly driven by human intuition, labor, and manual decision. It is ineffective and inefficient. Due to the complexity of material design and the magnitude of experimental and computational work, the discovery of materials with conventional methods usually takes very long development cycles (10-20 years) with enormous labor and costs. To address this challenge, we proposed a machine-learning framework called Material Artificial Intelligence Robotics-driven System (MARS), aiming to reduce the costs with the help of machine learning techniques. We applied advanced deep-learning networks to better predict conductivity. We explored neural network models and tree-based models such as LightGBM. In particular, we made the models more interpretable and identified the relationships between the electrolyte's composition and the ionic conductivity. To search for the optimal conductivity, we developed a sophisticated deep reinforcement learning (RL) model called DDPG (Deep Deterministic Policy Gradient) to explore novel recipes to reach much higher conductivity. DDPG begins the RL process by entering new states through actions, where each action at a specific state (with a one-hot vector, representing selections of electrolyte components) would yield a reward Q, trained by the predictor developed in the previous step. After the optimal compositions have been found for the maximum conductivity, voltage stability and modulus, new measurements would be conducted to confirm these compositions. The new measurement data were then fed back to improve the prediction model. In this way, the prediction model is constantly being updated by each RL prediction. Once a successful update has been made to the prediction model, the whole process iterates. Finally, a well-trained DDPG model combines the benefits of both Q-learning and Policy Gradient method. DDPG is faster, simpler, more robust, and able to achieve much higher conductivity than conventional search methods. Finally, the model could provide compositions that lead to higher conductivities than the highest conductivity in the training data. Then, we generated more training data according to these compositions to retrain the prediction model. The generated recipes have been attested both by machine learning metrics and wet lab experiments. The generated best conductivity (2:51e[superscript -3]) has meet our expectations of battery recipes.


2020 ◽  
Vol 2020 ◽  
pp. 1-12 ◽  
Author(s):  
Junta Wu ◽  
Huiyun Li

Deep deterministic policy gradient algorithm operating over continuous space of actions has attracted great attention for reinforcement learning. However, the exploration strategy through dynamic programming within the Bayesian belief state space is rather inefficient even for simple systems. Another problem is the sequential and iterative training data with autonomous vehicles subject to the law of causality, which is against the i.i.d. (independent identically distributed) data assumption of the training samples. This usually results in failure of the standard bootstrap when learning an optimal policy. In this paper, we propose a framework of m-out-of-n bootstrapped and aggregated multiple deep deterministic policy gradient to accelerate the training process and increase the performance. Experiment results on the 2D robot arm game show that the reward gained by the aggregated policy is 10%–50% better than those gained by subpolicies. Experiment results on the open racing car simulator (TORCS) demonstrate that the new algorithm can learn successful control policies with less training time by 56.7%. Analysis on convergence is also given from the perspective of probability and statistics. These results verify that the proposed method outperforms the existing algorithms in both efficiency and performance.


Author(s):  
Emmanuel Ifeanyi Iroegbu ◽  
Devaraj Madhavi

Deep reinforcement learning has been successful in solving common autonomous driving tasks such as lane-keeping by simply using pixel data from the front view camera as input. However, raw pixel data contains a very high-dimensional observation that affects the learning quality of the agent due to the complexity imposed by a 'realistic' urban environment. Ergo, we investigate how compressing the raw pixel data from high-dimensional state to low-dimensional latent space offline using a variational autoencoder can significantly improve the training of a deep reinforcement learning agent. We evaluated our method on a simulated autonomous vehicle in car learning to act and compared our results with many baselines including deep deterministic policy gradient, proximal policy optimization, and soft actorcritic. The result shows that the method greatly accelerates the training time and there was a remarkable improvement in the quality of the deep reinforcement learning agent.


Sign in / Sign up

Export Citation Format

Share Document