scholarly journals Deep Reinforcement Learning via Past-Success Directed Exploration

Author(s):  
Xiaoming Liu ◽  
Zhixiong Xu ◽  
Lei Cao ◽  
Xiliang Chen ◽  
Kai Kang

The balance between exploration and exploitation has always been a core challenge in reinforcement learning. This paper proposes “past-success exploration strategy combined with Softmax action selection”(PSE-Softmax) as an adaptive control method for taking advantage of the characteristics of the online learning process of the agent to adapt exploration parameters dynamically. The proposed strategy is tested on OpenAI Gym with discrete and continuous control tasks, and the experimental results show that PSE-Softmax strategy delivers better performance than deep reinforcement learning algorithms with basic exploration strategies.

2014 ◽  
Vol 571-572 ◽  
pp. 105-108
Author(s):  
Lin Xu

This paper proposes a new framework of combining reinforcement learning with cloud computing digital library. Unified self-learning algorithms, which includes reinforcement learning, artificial intelligence and etc, have led to many essential advances. Given the current status of highly-available models, analysts urgently desire the deployment of write-ahead logging. In this paper we examine how DNS can be applied to the investigation of superblocks, and introduce the reinforcement learning to improve the quality of current cloud computing digital library. The experimental results show that the method works more efficiency.


2020 ◽  
Vol 34 (04) ◽  
pp. 3316-3323
Author(s):  
Qingpeng Cai ◽  
Ling Pan ◽  
Pingzhong Tang

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.


2012 ◽  
Vol 182-183 ◽  
pp. 427-430
Author(s):  
Li Feng Wei ◽  
Liang Cheng ◽  
Xing Man Yang

A adaptive control method of the pulse demagnetizer was presented, Can adjust the strength of the charge current automatically according to the changes of the magnetic content to ensure the constant of the magnetic field.The experimental results have shown that it has the advantages of low power consumption, strong anti-interference capability, stable and reliable operation, long life and good demagnetizing effect, when compared to the conventional demagnetizers.


Author(s):  
Richard Cheng ◽  
Gábor Orosz ◽  
Richard M. Murray ◽  
Joel W. Burdick

Reinforcement Learning (RL) algorithms have found limited success beyond simulated applications, and one main reason is the absence of safety guarantees during the learning process. Real world systems would realistically fail or break before an optimal controller can be learned. To address this issue, we propose a controller architecture that combines (1) a model-free RL-based controller with (2) model-based controllers utilizing control barrier functions (CBFs) and (3) online learning of the unknown system dynamics, in order to ensure safety during learning. Our general framework leverages the success of RL algorithms to learn high-performance controllers, while the CBF-based controllers both guarantee safety and guide the learning process by constraining the set of explorable polices. We utilize Gaussian Processes (GPs) to model the system dynamics and its uncertainties. Our novel controller synthesis algorithm, RL-CBF, guarantees safety with high probability during the learning process, regardless of the RL algorithm used, and demonstrates greater policy exploration efficiency. We test our algorithm on (1) control of an inverted pendulum and (2) autonomous carfollowing with wireless vehicle-to-vehicle communication, and show that our algorithm attains much greater sample efficiency in learning than other state-of-the-art algorithms and maintains safety during the entire learning process.


2006 ◽  
Vol 04 (06) ◽  
pp. 1071-1083 ◽  
Author(s):  
C. L. CHEN ◽  
D. Y. DONG ◽  
Z. H. CHEN

This paper proposes a novel action selection method based on quantum computation and reinforcement learning (RL). Inspired by the advantages of quantum computation, the state/action in a RL system is represented with quantum superposition state. The probability of action eigenvalue is denoted by probability amplitude, which is updated according to rewards. And the action selection is carried out by observing quantum state according to collapse postulate of quantum measurement. The results of simulated experiments show that quantum computation can be effectively used to action selection and decision making through speeding up learning. This method also makes a good tradeoff between exploration and exploitation for RL using probability characteristics of quantum theory.


2020 ◽  
Vol 17 (2) ◽  
pp. 172988142091995 ◽  
Author(s):  
Yushan Sun ◽  
Xiangrui Ran ◽  
Jian Cao ◽  
Yueming Li

In view of the difficulties in the attitude determination of wrecked submarine and the automatic attitude matching of deep submergence rescue vehicles during the docking and guidance of a submarine rescue vehicle, this study proposes a docking method based on parameter adaptive control with acoustic and visual guidance. This study omits the process of obtaining the information of the wrecked submarine in advance, thus saving considerable detection time and improving rescue efficiency. A parameter adaptive controller based on reinforcement learning is designed. The S-plane and proportional integral derivative controllers are trained through reinforcement learning to obtain the control parameters in the improvement of the environmental adaptability and anti-current ability of deep submarine rescue vehicles. The effectiveness of the proposed method is proved by simulation and pool tests. The comparison experiment shows that the parameter adaptive controller based on reinforcement learning has better control effect, accuracy, and stability than the untrained control method.


2019 ◽  
Vol 63 (7) ◽  
pp. 995-1003
Author(s):  
Z Xu ◽  
L Cao ◽  
X Chen

Abstract Simple and efficient exploration remains a core challenge in deep reinforcement learning. While many exploration methods can be applied to high-dimensional tasks, these methods manually adjust exploration parameters according to domain knowledge. This paper proposes a novel method that can automatically balance exploration and exploitation, as well as combine on-policy and off-policy update targets through a dynamic weighted way based on value difference. The proposed method does not directly affect the probability of a selected action but utilizes the value difference produced during the learning process to adjust update target for guiding the direction of agent’s learning. We demonstrate the performance of the proposed method on CartPole-v1, MountainCar-v0, and LunarLander-v2 classic control tasks from the OpenAI Gym. Empirical evaluation results show that by integrating on-policy and off-policy update targets dynamically, this method exhibits superior performance and stability than does the exclusive use of the update target.


PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0250040
Author(s):  
Nicola Milano ◽  
Stefano Nolfi

The efficacy of evolutionary or reinforcement learning algorithms for continuous control optimization can be enhanced by including an additional neural network dedicated to features extraction trained through self-supervision. In this paper we introduce a method that permits to continue the training of the features extracting network during the training of the control network. We demonstrate that the parallel training of the two networks is crucial in the case of agents that operate on the basis of egocentric observations and that the extraction of features provides an advantage also in problems that do not benefit from dimensionality reduction. Finally, we compare different feature extracting methods and we show that sequence-to-sequence learning outperforms the alternative methods considered in previous studies.


Sign in / Sign up

Export Citation Format

Share Document