Approximation and optimality necessary conditions in relaxed stochastic control problems

Policy Direct Search for Effective Reinforcement Learning

10.26686/wgtn.17138678 ◽

2021 ◽

Author(s):

◽

Yiming Peng

Keyword(s):

Value Function ◽

Feature Learning ◽

Direct Search ◽

The State ◽

Cutting Edge ◽

Approximation Technique ◽

Control Problems ◽

Function Learning ◽

Policy Gradient ◽

Primal Dual

<p>Reinforcement Learning (RL) problems appear in diverse real-world applications and are gaining substantial attention in academia and industry. Policy Direct Search (PDS) is widely recognized as an effective approach to RL problems. However, existing PDS algorithms have some major limitations. First, many step-wise Policy Gradient Search (PGS) algorithms cannot effectively utilize informative historical gradients to accurately estimate policy gradients. Second, although evolutionary PDS algorithms do not rely on accurate policy gradient estimations and can explore learning environments effectively, they are not sample efficient at learning policies in the form of deep neural networks. Third, existing PGS algorithms often diverge easily due to the lack of reliable and flexible techniques for value function learning. Fourth, existing PGS algorithms have not provided suitable mechanisms to learn proper state features automatically. To address these limitations, the overall goal of this thesis is to develop effective policy direct search algorithms for tackling challenging RL problems through technical innovations in four key areas. First, the thesis aims to improve the accuracy of policy gradient estimation by utilizing historical gradients through a Primal-Dual Approximation technique. Second, the thesis targets on surpassing the state-of-the-art performance by properly balancing the exploration-exploitation trade-off via Covariance Matrix Adaption Evolutionary Strategy (CMA-ES) and Proximal Policy Optimization (PPO). Third, the thesis seeks to stabilize value function learning via a self-organized Sandpile Model (SM) meanwhile generalize the compatible condition to support flexible value function learning. Fourth, the thesis endeavors to develop innovative evolutionary feature learning techniques that are capable of automatically extracting useful state features so as to enhance various cutting-edge PGS algorithms. In the thesis, we explore the four key technical areas by studying policies with increasing complexity. First of all, we start the research from a simple linear policy representation, and then proceed to a complex neural network based policy representation. Next, we consider a more complicated situation where policy learning is coupled with a value function learning. Subsequently, we consider policies modeled as a concatenation of two interrelated networks, one for feature learning and one for action selection. To achieve the first goal, this thesis proposes a new policy gradient learning framework where a series of historical gradients are jointly exploited to obtain accurate policy gradient estimations via the Primal-Dual Approximation technique. Under the framework, three new PGS algorithms for step-wise policy training have been derived from three widely used PGS algorithms; meanwhile, the convergence properties of these new algorithms have been theoretically analyzed. The empirical results on several benchmark control problems further show that the newly proposed algorithms can significantly outperform their base algorithms. To achieve the second goal, this thesis develops a new sample efficient evolutionary deep policy optimization algorithm based on CMA-ES and PPO. The algorithm has a layer-wise learning mechanism to improve computational efficiency in comparison to CMA-ES. Additionally, it uses a performance lower bound based surrogate model for fitness evaluation to significantly reduce the sample cost to the state-of-the-art level. More importantly, the best policy found by CMA-ES at every generation is further improved by PPO to properly balance exploration and exploitation. The experimental results confirm that the proposed algorithm outperforms various cutting-edge algorithms on many benchmark continuous control problems. To achieve the third goal, this thesis develops new value function learning methods that are both reliable and flexible so as to further enhance the effectiveness of policy gradient search. Two Actor-Critic (AC) algorithms have been successfully developed from a commonly-used PGS algorithm, i.e., Regular Actor-Critic (RAC). The first algorithm adopts SM to stabilize value function learning, and the second algorithm generalizes the logarithm function used by the compatible condition to provide a flexible family of new compatible functions. The experimental results show that, with the help of reliable and flexible value function learning, the newly developed algorithms are more effective than RAC on several benchmark control problems. To achieve the fourth goal, this thesis develops innovative NeuroEvolution algorithms for automated feature learning to enhance various cutting-edge PGS algorithms. The newly developed algorithms not only can extract useful state features but also learn good policies. The experimental analysis demonstrates that the newly proposed algorithms can achieve better performance on large-scale RL problems in comparison to both well-known PGS algorithms and NeuroEvolution techniques. Our experiments also confirm that the state features learned by NeuroEvolution on one RL task can be easily transferred to boost learning performance on similar but different tasks.</p>

Download Full-text

CONSTRAINT ALGORITHM FOR EXTREMALS IN OPTIMAL CONTROL PROBLEMS

International Journal of Geometric Methods in Modern Physics ◽

10.1142/s0219887809004193 ◽

2009 ◽

Vol 06 (07) ◽

pp. 1221-1233 ◽

Cited By ~ 8

Author(s):

MARÍA BARBERO-LIÑÁN ◽

MIGUEL C. MUÑOZ-LECANDA

Keyword(s):

Optimal Control ◽

Maximum Principle ◽

Optimal Control Problem ◽

Control Theory ◽

Control Problem ◽

Optimal Control Theory ◽

Optimal Control Problems ◽

Necessary Conditions ◽

Control Problems ◽

Affine System

A geometric method is described to characterize the different kinds of extremals in optimal control theory. This comes from the use of a presymplectic constraint algorithm starting from the necessary conditions given by Pontryagin's Maximum Principle. The algorithm must be run twice so as to obtain suitable sets that once projected must be compared. Apart from the design of this general algorithm useful for any optimal control problem, it is shown how to classify the set of extremals and, in particular, how to characterize the strict abnormality. An example of strict abnormal extremal for a particular control-affine system is also given.

Download Full-text

Optimal control problems with elastic collisions

The Journal of the Australian Mathematical Society Series B Applied Mathematics ◽

10.1017/s0334270000006408 ◽

1989 ◽

Vol 30 (4) ◽

pp. 470-482

Author(s):

J. M. Murray

Keyword(s):

Optimal Control ◽

State Vector ◽

State Constraints ◽

Optimal Control Problems ◽

Necessary Conditions ◽

The State ◽

Control Problems ◽

Elastic Collisions ◽

Linear State

AbstractIn this paper consider we optimal control problems with linear state constraints where the states can be discontinuous at the boundary. In fact the state vector models the cause the position and velocity of a particle where the collisions with the boundary that cause the discontinuities are elastic. Necessary conditions are derived by looking at limits of approximate problems that are unconstrained.

Download Full-text

The connection between the maximum principle and the value function for optimal control problems under state constraints

2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No.04CH37601) ◽

10.1109/cdc.2004.1428798 ◽

2004 ◽

Cited By ~ 3

Author(s):

A. Cernea ◽

H. Frankowska

Keyword(s):

Optimal Control ◽

Maximum Principle ◽

State Constraints ◽

Value Function ◽

Optimal Control Problems ◽

Control Problems ◽

The Maximum Principle ◽

The Value Function

Download Full-text

Geometric Asymptotic Approximation of Value Functions

The B E Journal of Theoretical Economics ◽

10.2202/1935-1704.1532 ◽

2009 ◽

Vol 9 (1) ◽

Author(s):

Axel Anderson

Keyword(s):

Value Function ◽

Payoff Function ◽

The State ◽

Second Derivative ◽

Value Functions ◽

State Variable ◽

Specific Formula ◽

Geometric Term ◽

Dynamic Stochastic ◽

The Value Function

This paper characterizes the behavior of value functions in dynamic stochastic discounted programming models near fixed points of the state space. When the second derivative of the flow payoff function is bounded, the value function is proportional to a linear function plus geometric term. A specific formula for the exponent of this geometric term is provided. This exponent continuously falls in the rate of patience.If the state variable is a martingale, the second derivative of the value function is unbounded. If the state variable is instead a strict local submartingale, then the same holds for the first derivative of the value function. Thus, the proposed approximation is more accurate than Taylor series approximation.The approximation result is used to characterize locally optimal policies in several fundamental economic problems.

Download Full-text

New necessary conditions of optimality for control problems with state-variable inequality constraints

Journal of Mathematical Analysis and Applications ◽

10.1016/0022-247x(71)90219-8 ◽

1971 ◽

Vol 35 (2) ◽

pp. 255-284 ◽

Cited By ~ 221

Author(s):

D.H Jacobson ◽

M.M Lele ◽

J.L Speyer

Keyword(s):

Necessary Conditions ◽

Inequality Constraints ◽

Control Problems ◽

State Variable ◽

Necessary Conditions Of Optimality ◽

Variable Inequality

Download Full-text

On stochastic relaxed control for partially observed diffusions

Nagoya Mathematical Journal ◽

10.1017/s0027763000020742 ◽

1984 ◽

Vol 93 ◽

pp. 71-108 ◽

Cited By ~ 44

Author(s):

W. H. Fleming ◽

M. Nisio

Keyword(s):

Differential Equations ◽

Stochastic Differential Equations ◽

Control Region ◽

The State ◽

Probability Measures ◽

Control Problems ◽

Relaxed Control ◽

Observation Process ◽

Partially Observed ◽

Partially Observed Diffusions

In this paper we are concerned with stochastic relaxed control problems of the following kind. Let X(t), t ≥ 0, denote the state of a process being controlled, Y(t), t ≥ 0, the observation process and p(t, ·) a relaxed control, that is a process with values probability measures on the control region Г. The state and observation processes are governed by stochastic differential equationsandwhere B and W are independent Brownian motions with values in Rn and Rm respectively, (put m = 1 for simplicity).

Download Full-text

Policy Direct Search for Effective Reinforcement Learning

10.26686/wgtn.17138678.v1 ◽

2021 ◽

Author(s):

◽

Yiming Peng

Keyword(s):

Value Function ◽

Feature Learning ◽

Direct Search ◽

The State ◽

Cutting Edge ◽

Approximation Technique ◽

Control Problems ◽

Function Learning ◽

Policy Gradient ◽

Primal Dual

<p>Reinforcement Learning (RL) problems appear in diverse real-world applications and are gaining substantial attention in academia and industry. Policy Direct Search (PDS) is widely recognized as an effective approach to RL problems. However, existing PDS algorithms have some major limitations. First, many step-wise Policy Gradient Search (PGS) algorithms cannot effectively utilize informative historical gradients to accurately estimate policy gradients. Second, although evolutionary PDS algorithms do not rely on accurate policy gradient estimations and can explore learning environments effectively, they are not sample efficient at learning policies in the form of deep neural networks. Third, existing PGS algorithms often diverge easily due to the lack of reliable and flexible techniques for value function learning. Fourth, existing PGS algorithms have not provided suitable mechanisms to learn proper state features automatically. To address these limitations, the overall goal of this thesis is to develop effective policy direct search algorithms for tackling challenging RL problems through technical innovations in four key areas. First, the thesis aims to improve the accuracy of policy gradient estimation by utilizing historical gradients through a Primal-Dual Approximation technique. Second, the thesis targets on surpassing the state-of-the-art performance by properly balancing the exploration-exploitation trade-off via Covariance Matrix Adaption Evolutionary Strategy (CMA-ES) and Proximal Policy Optimization (PPO). Third, the thesis seeks to stabilize value function learning via a self-organized Sandpile Model (SM) meanwhile generalize the compatible condition to support flexible value function learning. Fourth, the thesis endeavors to develop innovative evolutionary feature learning techniques that are capable of automatically extracting useful state features so as to enhance various cutting-edge PGS algorithms. In the thesis, we explore the four key technical areas by studying policies with increasing complexity. First of all, we start the research from a simple linear policy representation, and then proceed to a complex neural network based policy representation. Next, we consider a more complicated situation where policy learning is coupled with a value function learning. Subsequently, we consider policies modeled as a concatenation of two interrelated networks, one for feature learning and one for action selection. To achieve the first goal, this thesis proposes a new policy gradient learning framework where a series of historical gradients are jointly exploited to obtain accurate policy gradient estimations via the Primal-Dual Approximation technique. Under the framework, three new PGS algorithms for step-wise policy training have been derived from three widely used PGS algorithms; meanwhile, the convergence properties of these new algorithms have been theoretically analyzed. The empirical results on several benchmark control problems further show that the newly proposed algorithms can significantly outperform their base algorithms. To achieve the second goal, this thesis develops a new sample efficient evolutionary deep policy optimization algorithm based on CMA-ES and PPO. The algorithm has a layer-wise learning mechanism to improve computational efficiency in comparison to CMA-ES. Additionally, it uses a performance lower bound based surrogate model for fitness evaluation to significantly reduce the sample cost to the state-of-the-art level. More importantly, the best policy found by CMA-ES at every generation is further improved by PPO to properly balance exploration and exploitation. The experimental results confirm that the proposed algorithm outperforms various cutting-edge algorithms on many benchmark continuous control problems. To achieve the third goal, this thesis develops new value function learning methods that are both reliable and flexible so as to further enhance the effectiveness of policy gradient search. Two Actor-Critic (AC) algorithms have been successfully developed from a commonly-used PGS algorithm, i.e., Regular Actor-Critic (RAC). The first algorithm adopts SM to stabilize value function learning, and the second algorithm generalizes the logarithm function used by the compatible condition to provide a flexible family of new compatible functions. The experimental results show that, with the help of reliable and flexible value function learning, the newly developed algorithms are more effective than RAC on several benchmark control problems. To achieve the fourth goal, this thesis develops innovative NeuroEvolution algorithms for automated feature learning to enhance various cutting-edge PGS algorithms. The newly developed algorithms not only can extract useful state features but also learn good policies. The experimental analysis demonstrates that the newly proposed algorithms can achieve better performance on large-scale RL problems in comparison to both well-known PGS algorithms and NeuroEvolution techniques. Our experiments also confirm that the state features learned by NeuroEvolution on one RL task can be easily transferred to boost learning performance on similar but different tasks.</p>

Download Full-text

Value function and necessary conditions in optimal control problems for differential-difference inclusions

Nonlinear Analysis ◽

10.1016/s0362-546x(02)00306-1 ◽

2003 ◽

Vol 53 (3-4) ◽

pp. 407-424 ◽

Cited By ~ 3

Author(s):

Leonid I. Minchenko ◽

Alexey A. Volosevich

Keyword(s):

Optimal Control ◽

Value Function ◽

Optimal Control Problems ◽

Necessary Conditions ◽

Control Problems ◽

Difference Inclusions

Download Full-text

Indirect Solution of Inequality Constrained and Singular Optimal Control Problems Via a Simple Continuation Method

Journal of Dynamic Systems Measurement and Control ◽

10.1115/1.4025596 ◽

2013 ◽

Vol 136 (2) ◽

Cited By ~ 11

Author(s):

Brian C. Fabien

Keyword(s):

Optimal Control ◽

Original Problem ◽

Optimal Control Problems ◽

Necessary Conditions ◽

Continuation Method ◽

Singular Control ◽

Inequality Constraints ◽

Control Problems ◽

Quadratic Loss ◽

State Variable

This paper develops a simple continuation method for the approximate solution of optimal control problems. The class of optimal control problems considered include (i) problems with bounded controls, (ii) problems with state variable inequality constraints (SVIC), and (iii) singular control problems. The method used here is based on transforming the state variable inequality constraints into equality constraints using nonnegative slack variables. The resultant equality constraints are satisfied approximately using a quadratic loss penalty function. Similarly, singular control problems are made nonsingular using a quadratic loss penalty function based on the control. The solution of the original problem is obtained by solving the transformed problem with a sequence of penalty weights that tends to zero. The penalty weight is treated as the continuation parameter. The paper shows that the transformed problem yields necessary conditions for a minimum that can be written as a boundary value problem involving index-1 differential–algebraic equations (BVP-DAE). The BVP-DAE includes the complementarity conditions associated with the inequality constraints. It is also shown that the necessary conditions for optimality of the original problem and the transformed problem differ by a term that depends linearly on the algebraic variables in the DAE. Numerical examples are presented to illustrate the efficacy of the proposed technique.

Download Full-text