scholarly journals Policy Direct Search for Effective Reinforcement Learning

2021 ◽  
Author(s):  
◽  
Yiming Peng

<p>Reinforcement Learning (RL) problems appear in diverse real-world applications and are gaining substantial attention in academia and industry. Policy Direct Search (PDS) is widely recognized as an effective approach to RL problems. However, existing PDS algorithms have some major limitations. First, many step-wise Policy Gradient Search (PGS) algorithms cannot effectively utilize informative historical gradients to accurately estimate policy gradients. Second, although evolutionary PDS algorithms do not rely on accurate policy gradient estimations and can explore learning environments effectively, they are not sample efficient at learning policies in the form of deep neural networks. Third, existing PGS algorithms often diverge easily due to the lack of reliable and flexible techniques for value function learning. Fourth, existing PGS algorithms have not provided suitable mechanisms to learn proper state features automatically.  To address these limitations, the overall goal of this thesis is to develop effective policy direct search algorithms for tackling challenging RL problems through technical innovations in four key areas. First, the thesis aims to improve the accuracy of policy gradient estimation by utilizing historical gradients through a Primal-Dual Approximation technique. Second, the thesis targets on surpassing the state-of-the-art performance by properly balancing the exploration-exploitation trade-off via Covariance Matrix Adaption Evolutionary Strategy (CMA-ES) and Proximal Policy Optimization (PPO). Third, the thesis seeks to stabilize value function learning via a self-organized Sandpile Model (SM) meanwhile generalize the compatible condition to support flexible value function learning. Fourth, the thesis endeavors to develop innovative evolutionary feature learning techniques that are capable of automatically extracting useful state features so as to enhance various cutting-edge PGS algorithms.  In the thesis, we explore the four key technical areas by studying policies with increasing complexity. First of all, we start the research from a simple linear policy representation, and then proceed to a complex neural network based policy representation. Next, we consider a more complicated situation where policy learning is coupled with a value function learning. Subsequently, we consider policies modeled as a concatenation of two interrelated networks, one for feature learning and one for action selection.  To achieve the first goal, this thesis proposes a new policy gradient learning framework where a series of historical gradients are jointly exploited to obtain accurate policy gradient estimations via the Primal-Dual Approximation technique. Under the framework, three new PGS algorithms for step-wise policy training have been derived from three widely used PGS algorithms; meanwhile, the convergence properties of these new algorithms have been theoretically analyzed. The empirical results on several benchmark control problems further show that the newly proposed algorithms can significantly outperform their base algorithms.  To achieve the second goal, this thesis develops a new sample efficient evolutionary deep policy optimization algorithm based on CMA-ES and PPO. The algorithm has a layer-wise learning mechanism to improve computational efficiency in comparison to CMA-ES. Additionally, it uses a performance lower bound based surrogate model for fitness evaluation to significantly reduce the sample cost to the state-of-the-art level. More importantly, the best policy found by CMA-ES at every generation is further improved by PPO to properly balance exploration and exploitation. The experimental results confirm that the proposed algorithm outperforms various cutting-edge algorithms on many benchmark continuous control problems.  To achieve the third goal, this thesis develops new value function learning methods that are both reliable and flexible so as to further enhance the effectiveness of policy gradient search. Two Actor-Critic (AC) algorithms have been successfully developed from a commonly-used PGS algorithm, i.e., Regular Actor-Critic (RAC). The first algorithm adopts SM to stabilize value function learning, and the second algorithm generalizes the logarithm function used by the compatible condition to provide a flexible family of new compatible functions. The experimental results show that, with the help of reliable and flexible value function learning, the newly developed algorithms are more effective than RAC on several benchmark control problems.  To achieve the fourth goal, this thesis develops innovative NeuroEvolution algorithms for automated feature learning to enhance various cutting-edge PGS algorithms. The newly developed algorithms not only can extract useful state features but also learn good policies. The experimental analysis demonstrates that the newly proposed algorithms can achieve better performance on large-scale RL problems in comparison to both well-known PGS algorithms and NeuroEvolution techniques. Our experiments also confirm that the state features learned by NeuroEvolution on one RL task can be easily transferred to boost learning performance on similar but different tasks.</p>

2021 ◽  
Author(s):  
◽  
Yiming Peng

<p>Reinforcement Learning (RL) problems appear in diverse real-world applications and are gaining substantial attention in academia and industry. Policy Direct Search (PDS) is widely recognized as an effective approach to RL problems. However, existing PDS algorithms have some major limitations. First, many step-wise Policy Gradient Search (PGS) algorithms cannot effectively utilize informative historical gradients to accurately estimate policy gradients. Second, although evolutionary PDS algorithms do not rely on accurate policy gradient estimations and can explore learning environments effectively, they are not sample efficient at learning policies in the form of deep neural networks. Third, existing PGS algorithms often diverge easily due to the lack of reliable and flexible techniques for value function learning. Fourth, existing PGS algorithms have not provided suitable mechanisms to learn proper state features automatically.  To address these limitations, the overall goal of this thesis is to develop effective policy direct search algorithms for tackling challenging RL problems through technical innovations in four key areas. First, the thesis aims to improve the accuracy of policy gradient estimation by utilizing historical gradients through a Primal-Dual Approximation technique. Second, the thesis targets on surpassing the state-of-the-art performance by properly balancing the exploration-exploitation trade-off via Covariance Matrix Adaption Evolutionary Strategy (CMA-ES) and Proximal Policy Optimization (PPO). Third, the thesis seeks to stabilize value function learning via a self-organized Sandpile Model (SM) meanwhile generalize the compatible condition to support flexible value function learning. Fourth, the thesis endeavors to develop innovative evolutionary feature learning techniques that are capable of automatically extracting useful state features so as to enhance various cutting-edge PGS algorithms.  In the thesis, we explore the four key technical areas by studying policies with increasing complexity. First of all, we start the research from a simple linear policy representation, and then proceed to a complex neural network based policy representation. Next, we consider a more complicated situation where policy learning is coupled with a value function learning. Subsequently, we consider policies modeled as a concatenation of two interrelated networks, one for feature learning and one for action selection.  To achieve the first goal, this thesis proposes a new policy gradient learning framework where a series of historical gradients are jointly exploited to obtain accurate policy gradient estimations via the Primal-Dual Approximation technique. Under the framework, three new PGS algorithms for step-wise policy training have been derived from three widely used PGS algorithms; meanwhile, the convergence properties of these new algorithms have been theoretically analyzed. The empirical results on several benchmark control problems further show that the newly proposed algorithms can significantly outperform their base algorithms.  To achieve the second goal, this thesis develops a new sample efficient evolutionary deep policy optimization algorithm based on CMA-ES and PPO. The algorithm has a layer-wise learning mechanism to improve computational efficiency in comparison to CMA-ES. Additionally, it uses a performance lower bound based surrogate model for fitness evaluation to significantly reduce the sample cost to the state-of-the-art level. More importantly, the best policy found by CMA-ES at every generation is further improved by PPO to properly balance exploration and exploitation. The experimental results confirm that the proposed algorithm outperforms various cutting-edge algorithms on many benchmark continuous control problems.  To achieve the third goal, this thesis develops new value function learning methods that are both reliable and flexible so as to further enhance the effectiveness of policy gradient search. Two Actor-Critic (AC) algorithms have been successfully developed from a commonly-used PGS algorithm, i.e., Regular Actor-Critic (RAC). The first algorithm adopts SM to stabilize value function learning, and the second algorithm generalizes the logarithm function used by the compatible condition to provide a flexible family of new compatible functions. The experimental results show that, with the help of reliable and flexible value function learning, the newly developed algorithms are more effective than RAC on several benchmark control problems.  To achieve the fourth goal, this thesis develops innovative NeuroEvolution algorithms for automated feature learning to enhance various cutting-edge PGS algorithms. The newly developed algorithms not only can extract useful state features but also learn good policies. The experimental analysis demonstrates that the newly proposed algorithms can achieve better performance on large-scale RL problems in comparison to both well-known PGS algorithms and NeuroEvolution techniques. Our experiments also confirm that the state features learned by NeuroEvolution on one RL task can be easily transferred to boost learning performance on similar but different tasks.</p>


2006 ◽  
Vol 2006 ◽  
pp. 1-23 ◽  
Author(s):  
Seïd Bahlali ◽  
Brahim Mezerdi ◽  
Boualem Djehiche

We consider a control problem where the state variable is a solution of a stochastic differential equation (SDE) in which the control enters both the drift and the diffusion coefficient. We study the relaxed problem for which admissible controls are measure-valued processes and the state variable is governed by an SDE driven by an orthogonal martingale measure. Under some mild conditions on the coefficients and pathwise uniqueness, we prove that every diffusion process associated to a relaxed control is a strong limit of a sequence of diffusion processes associated to strict controls. As a consequence, we show that the strict and the relaxed control problems have the same value function and that an optimal relaxed control exists. Moreover we derive a maximum principle of the Pontriagin type, extending the well-known Peng stochastic maximum principle to the class of measure-valued controls.


2020 ◽  
Vol 26 (2) ◽  
pp. 131-161
Author(s):  
Florian Bourgey ◽  
Stefano De Marco ◽  
Emmanuel Gobet ◽  
Alexandre Zhou

AbstractThe multilevel Monte Carlo (MLMC) method developed by M. B. Giles [Multilevel Monte Carlo path simulation, Oper. Res. 56 2008, 3, 607–617] has a natural application to the evaluation of nested expectations {\mathbb{E}[g(\mathbb{E}[f(X,Y)|X])]}, where {f,g} are functions and {(X,Y)} a couple of independent random variables. Apart from the pricing of American-type derivatives, such computations arise in a large variety of risk valuations (VaR or CVaR of a portfolio, CVA), and in the assessment of margin costs for centrally cleared portfolios. In this work, we focus on the computation of initial margin. We analyze the properties of corresponding MLMC estimators, for which we provide results of asymptotic optimality; at the technical level, we have to deal with limited regularity of the outer function g (which might fail to be everywhere differentiable). Parallel to this, we investigate upper and lower bounds for nested expectations as above, in the spirit of primal-dual algorithms for stochastic control problems.


2016 ◽  
Vol 8 (6) ◽  
pp. 1050-1071 ◽  
Author(s):  
Tianliang Hou ◽  
Li Li

AbstractIn this paper, we investigate the error estimates of mixed finite element methods for optimal control problems governed by general elliptic equations. The state and co-state are approximated by the lowest order Raviart-Thomas mixed finite element spaces and the control variable is approximated by piecewise constant functions. We derive L2 and H–1-error estimates both for the control variable and the state variables. Finally, a numerical example is given to demonstrate the theoretical results.


Author(s):  
Alekos Cecchin

We examine mean field control problems  on a finite state space, in continuous time and over a finite time horizon. We characterize the value function of the mean field control problem as the unique viscosity solution of a Hamilton-Jacobi-Bellman equation in the simplex. In absence of any convexity assumption, we exploit this characterization to prove convergence, as $N$ grows, of the value functions of the centralized $N$-agent optimal control problem to the limit mean field control problem  value function, with a convergence rate of order $\frac{1}{\sqrt{N}}$. Then, assuming convexity, we show that the limit value function is smooth and establish propagation of chaos, i.e.  convergence of the $N$-agent optimal trajectories to the unique limiting optimal trajectory, with an explicit rate.


Author(s):  
Jean Walrand

AbstractThere is a class of control problems that admit a particularly elegant solution: the linear quadratic Gaussian (LQG) problems. In these problems, the state dynamics and observations are linear, the cost is quadratic, and the noise is Gaussian. Section 14.1 explains the theory of LQG problems when one observes the state. Section 14.2 discusses the situation when the observations are noisy and shows the remarkable certainty equivalence property of the solution. Section 14.3 explains how noisy observations affect Markov decision problems.


2009 ◽  
Vol 13 ◽  
pp. 111-123 ◽  
Author(s):  
Esteve Juanola-Feliu

Abstract This paper analyses the state of the art for nanotechnology in Barcelona, focussing on the scientific and economic challenges arising from nanotechnologies and the creative and innovative framework in Barcelona that could be used to meet them. Nanotechnology is an endless source of innovation and creativity at the intersection of medicine, biotechnology, engineering, physical sciences and information technology, and it is opening up new directions in R + D, knowledge management and technology transfer. Given the huge economic investment and cutting-edge research in the field of nanotechnology, a creatively managed and cooperation-based university industry is more in demand than ever before.


Sensors ◽  
2020 ◽  
Vol 20 (18) ◽  
pp. 5443
Author(s):  
Hongyu Hu ◽  
Ziyang Lu ◽  
Qi Wang ◽  
Chengyuan Zheng

Changing lanes while driving requires coordinating the lateral and longitudinal controls of a vehicle, considering its running state and the surrounding environment. Although the existing rule-based automated lane-changing method is simple, it is unsuitable for unpredictable scenarios encountered in practice. Therefore, using a deep deterministic policy gradient (DDPG) algorithm, we propose an end-to-end method for automated lane changing based on lidar data. The distance state information of the lane boundary and the surrounding vehicles obtained by the agent in a simulation environment is denoted as the state space for an automated lane-change problem based on reinforcement learning. The steering wheel angle and longitudinal acceleration are used as the action space, and both the state and action spaces are continuous. In terms of the reward function, avoiding collision and setting different expected lane-changing distances that represent different driving styles are considered for security, and the angular velocity of the steering wheel and jerk are considered for comfort. The minimum speed limit for lane changing and the control of the agent for a quick lane change are considered for efficiency. For a one-way two-lane road, a visual simulation environment scene is constructed using Pyglet. By comparing the lane-changing process tracks of two driving styles in a simplified traffic flow scene, we study the influence of driving style on the lane-changing process and lane-changing time. Through the training and adjustment of the combined lateral and longitudinal control of autonomous vehicles with different driving styles in complex traffic scenes, the vehicles could complete a series of driving tasks while considering driving-style differences. The experimental results show that autonomous vehicles can reflect the differences in the driving styles at the time of lane change at the same speed. Under the combined lateral and longitudinal control, the autonomous vehicles exhibit good robustness to different speeds and traffic density in different road sections. Thus, autonomous vehicles trained using the proposed method can learn an automated lane-changing policy while considering safety, comfort, and efficiency.


Sign in / Sign up

Export Citation Format

Share Document