scholarly journals Approximation and optimality necessary conditions in relaxed stochastic control problems

2006 ◽  
Vol 2006 ◽  
pp. 1-23 ◽  
Author(s):  
Seïd Bahlali ◽  
Brahim Mezerdi ◽  
Boualem Djehiche

We consider a control problem where the state variable is a solution of a stochastic differential equation (SDE) in which the control enters both the drift and the diffusion coefficient. We study the relaxed problem for which admissible controls are measure-valued processes and the state variable is governed by an SDE driven by an orthogonal martingale measure. Under some mild conditions on the coefficients and pathwise uniqueness, we prove that every diffusion process associated to a relaxed control is a strong limit of a sequence of diffusion processes associated to strict controls. As a consequence, we show that the strict and the relaxed control problems have the same value function and that an optimal relaxed control exists. Moreover we derive a maximum principle of the Pontriagin type, extending the well-known Peng stochastic maximum principle to the class of measure-valued controls.

2021 ◽  
Author(s):  
◽  
Yiming Peng

<p>Reinforcement Learning (RL) problems appear in diverse real-world applications and are gaining substantial attention in academia and industry. Policy Direct Search (PDS) is widely recognized as an effective approach to RL problems. However, existing PDS algorithms have some major limitations. First, many step-wise Policy Gradient Search (PGS) algorithms cannot effectively utilize informative historical gradients to accurately estimate policy gradients. Second, although evolutionary PDS algorithms do not rely on accurate policy gradient estimations and can explore learning environments effectively, they are not sample efficient at learning policies in the form of deep neural networks. Third, existing PGS algorithms often diverge easily due to the lack of reliable and flexible techniques for value function learning. Fourth, existing PGS algorithms have not provided suitable mechanisms to learn proper state features automatically.  To address these limitations, the overall goal of this thesis is to develop effective policy direct search algorithms for tackling challenging RL problems through technical innovations in four key areas. First, the thesis aims to improve the accuracy of policy gradient estimation by utilizing historical gradients through a Primal-Dual Approximation technique. Second, the thesis targets on surpassing the state-of-the-art performance by properly balancing the exploration-exploitation trade-off via Covariance Matrix Adaption Evolutionary Strategy (CMA-ES) and Proximal Policy Optimization (PPO). Third, the thesis seeks to stabilize value function learning via a self-organized Sandpile Model (SM) meanwhile generalize the compatible condition to support flexible value function learning. Fourth, the thesis endeavors to develop innovative evolutionary feature learning techniques that are capable of automatically extracting useful state features so as to enhance various cutting-edge PGS algorithms.  In the thesis, we explore the four key technical areas by studying policies with increasing complexity. First of all, we start the research from a simple linear policy representation, and then proceed to a complex neural network based policy representation. Next, we consider a more complicated situation where policy learning is coupled with a value function learning. Subsequently, we consider policies modeled as a concatenation of two interrelated networks, one for feature learning and one for action selection.  To achieve the first goal, this thesis proposes a new policy gradient learning framework where a series of historical gradients are jointly exploited to obtain accurate policy gradient estimations via the Primal-Dual Approximation technique. Under the framework, three new PGS algorithms for step-wise policy training have been derived from three widely used PGS algorithms; meanwhile, the convergence properties of these new algorithms have been theoretically analyzed. The empirical results on several benchmark control problems further show that the newly proposed algorithms can significantly outperform their base algorithms.  To achieve the second goal, this thesis develops a new sample efficient evolutionary deep policy optimization algorithm based on CMA-ES and PPO. The algorithm has a layer-wise learning mechanism to improve computational efficiency in comparison to CMA-ES. Additionally, it uses a performance lower bound based surrogate model for fitness evaluation to significantly reduce the sample cost to the state-of-the-art level. More importantly, the best policy found by CMA-ES at every generation is further improved by PPO to properly balance exploration and exploitation. The experimental results confirm that the proposed algorithm outperforms various cutting-edge algorithms on many benchmark continuous control problems.  To achieve the third goal, this thesis develops new value function learning methods that are both reliable and flexible so as to further enhance the effectiveness of policy gradient search. Two Actor-Critic (AC) algorithms have been successfully developed from a commonly-used PGS algorithm, i.e., Regular Actor-Critic (RAC). The first algorithm adopts SM to stabilize value function learning, and the second algorithm generalizes the logarithm function used by the compatible condition to provide a flexible family of new compatible functions. The experimental results show that, with the help of reliable and flexible value function learning, the newly developed algorithms are more effective than RAC on several benchmark control problems.  To achieve the fourth goal, this thesis develops innovative NeuroEvolution algorithms for automated feature learning to enhance various cutting-edge PGS algorithms. The newly developed algorithms not only can extract useful state features but also learn good policies. The experimental analysis demonstrates that the newly proposed algorithms can achieve better performance on large-scale RL problems in comparison to both well-known PGS algorithms and NeuroEvolution techniques. Our experiments also confirm that the state features learned by NeuroEvolution on one RL task can be easily transferred to boost learning performance on similar but different tasks.</p>


2009 ◽  
Vol 06 (07) ◽  
pp. 1221-1233 ◽  
Author(s):  
MARÍA BARBERO-LIÑÁN ◽  
MIGUEL C. MUÑOZ-LECANDA

A geometric method is described to characterize the different kinds of extremals in optimal control theory. This comes from the use of a presymplectic constraint algorithm starting from the necessary conditions given by Pontryagin's Maximum Principle. The algorithm must be run twice so as to obtain suitable sets that once projected must be compared. Apart from the design of this general algorithm useful for any optimal control problem, it is shown how to classify the set of extremals and, in particular, how to characterize the strict abnormality. An example of strict abnormal extremal for a particular control-affine system is also given.


Author(s):  
J. M. Murray

AbstractIn this paper consider we optimal control problems with linear state constraints where the states can be discontinuous at the boundary. In fact the state vector models the cause the position and velocity of a particle where the collisions with the boundary that cause the discontinuities are elastic. Necessary conditions are derived by looking at limits of approximate problems that are unconstrained.


2009 ◽  
Vol 9 (1) ◽  
Author(s):  
Axel Anderson

This paper characterizes the behavior of value functions in dynamic stochastic discounted programming models near fixed points of the state space. When the second derivative of the flow payoff function is bounded, the value function is proportional to a linear function plus geometric term. A specific formula for the exponent of this geometric term is provided. This exponent continuously falls in the rate of patience.If the state variable is a martingale, the second derivative of the value function is unbounded. If the state variable is instead a strict local submartingale, then the same holds for the first derivative of the value function. Thus, the proposed approximation is more accurate than Taylor series approximation.The approximation result is used to characterize locally optimal policies in several fundamental economic problems.


1984 ◽  
Vol 93 ◽  
pp. 71-108 ◽  
Author(s):  
W. H. Fleming ◽  
M. Nisio

In this paper we are concerned with stochastic relaxed control problems of the following kind. Let X(t), t ≥ 0, denote the state of a process being controlled, Y(t), t ≥ 0, the observation process and p(t, ·) a relaxed control, that is a process with values probability measures on the control region Г. The state and observation processes are governed by stochastic differential equationsandwhere B and W are independent Brownian motions with values in Rn and Rm respectively, (put m = 1 for simplicity).


2021 ◽  
Author(s):  
◽  
Yiming Peng

<p>Reinforcement Learning (RL) problems appear in diverse real-world applications and are gaining substantial attention in academia and industry. Policy Direct Search (PDS) is widely recognized as an effective approach to RL problems. However, existing PDS algorithms have some major limitations. First, many step-wise Policy Gradient Search (PGS) algorithms cannot effectively utilize informative historical gradients to accurately estimate policy gradients. Second, although evolutionary PDS algorithms do not rely on accurate policy gradient estimations and can explore learning environments effectively, they are not sample efficient at learning policies in the form of deep neural networks. Third, existing PGS algorithms often diverge easily due to the lack of reliable and flexible techniques for value function learning. Fourth, existing PGS algorithms have not provided suitable mechanisms to learn proper state features automatically.  To address these limitations, the overall goal of this thesis is to develop effective policy direct search algorithms for tackling challenging RL problems through technical innovations in four key areas. First, the thesis aims to improve the accuracy of policy gradient estimation by utilizing historical gradients through a Primal-Dual Approximation technique. Second, the thesis targets on surpassing the state-of-the-art performance by properly balancing the exploration-exploitation trade-off via Covariance Matrix Adaption Evolutionary Strategy (CMA-ES) and Proximal Policy Optimization (PPO). Third, the thesis seeks to stabilize value function learning via a self-organized Sandpile Model (SM) meanwhile generalize the compatible condition to support flexible value function learning. Fourth, the thesis endeavors to develop innovative evolutionary feature learning techniques that are capable of automatically extracting useful state features so as to enhance various cutting-edge PGS algorithms.  In the thesis, we explore the four key technical areas by studying policies with increasing complexity. First of all, we start the research from a simple linear policy representation, and then proceed to a complex neural network based policy representation. Next, we consider a more complicated situation where policy learning is coupled with a value function learning. Subsequently, we consider policies modeled as a concatenation of two interrelated networks, one for feature learning and one for action selection.  To achieve the first goal, this thesis proposes a new policy gradient learning framework where a series of historical gradients are jointly exploited to obtain accurate policy gradient estimations via the Primal-Dual Approximation technique. Under the framework, three new PGS algorithms for step-wise policy training have been derived from three widely used PGS algorithms; meanwhile, the convergence properties of these new algorithms have been theoretically analyzed. The empirical results on several benchmark control problems further show that the newly proposed algorithms can significantly outperform their base algorithms.  To achieve the second goal, this thesis develops a new sample efficient evolutionary deep policy optimization algorithm based on CMA-ES and PPO. The algorithm has a layer-wise learning mechanism to improve computational efficiency in comparison to CMA-ES. Additionally, it uses a performance lower bound based surrogate model for fitness evaluation to significantly reduce the sample cost to the state-of-the-art level. More importantly, the best policy found by CMA-ES at every generation is further improved by PPO to properly balance exploration and exploitation. The experimental results confirm that the proposed algorithm outperforms various cutting-edge algorithms on many benchmark continuous control problems.  To achieve the third goal, this thesis develops new value function learning methods that are both reliable and flexible so as to further enhance the effectiveness of policy gradient search. Two Actor-Critic (AC) algorithms have been successfully developed from a commonly-used PGS algorithm, i.e., Regular Actor-Critic (RAC). The first algorithm adopts SM to stabilize value function learning, and the second algorithm generalizes the logarithm function used by the compatible condition to provide a flexible family of new compatible functions. The experimental results show that, with the help of reliable and flexible value function learning, the newly developed algorithms are more effective than RAC on several benchmark control problems.  To achieve the fourth goal, this thesis develops innovative NeuroEvolution algorithms for automated feature learning to enhance various cutting-edge PGS algorithms. The newly developed algorithms not only can extract useful state features but also learn good policies. The experimental analysis demonstrates that the newly proposed algorithms can achieve better performance on large-scale RL problems in comparison to both well-known PGS algorithms and NeuroEvolution techniques. Our experiments also confirm that the state features learned by NeuroEvolution on one RL task can be easily transferred to boost learning performance on similar but different tasks.</p>


Author(s):  
Brian C. Fabien

This paper develops a simple continuation method for the approximate solution of optimal control problems. The class of optimal control problems considered include (i) problems with bounded controls, (ii) problems with state variable inequality constraints (SVIC), and (iii) singular control problems. The method used here is based on transforming the state variable inequality constraints into equality constraints using nonnegative slack variables. The resultant equality constraints are satisfied approximately using a quadratic loss penalty function. Similarly, singular control problems are made nonsingular using a quadratic loss penalty function based on the control. The solution of the original problem is obtained by solving the transformed problem with a sequence of penalty weights that tends to zero. The penalty weight is treated as the continuation parameter. The paper shows that the transformed problem yields necessary conditions for a minimum that can be written as a boundary value problem involving index-1 differential–algebraic equations (BVP-DAE). The BVP-DAE includes the complementarity conditions associated with the inequality constraints. It is also shown that the necessary conditions for optimality of the original problem and the transformed problem differ by a term that depends linearly on the algebraic variables in the DAE. Numerical examples are presented to illustrate the efficacy of the proposed technique.


Sign in / Sign up

Export Citation Format

Share Document