Infinite-Horizon Policy-Gradient Estimation

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes POMDPs controlled by parameterized stochastic policies. A similar algorithm was proposed by (Kimura et al. 1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free beta (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter beta is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter et al., this volume) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.

Download Full-text

Experiments with Infinite-Horizon, Policy-Gradient Estimation

Journal of Artificial Intelligence Research ◽

10.1613/jair.807 ◽

2001 ◽

Vol 15 ◽

pp. 351-381 ◽

Cited By ~ 63

Author(s):

J. Baxter ◽

P. L. Bartlett ◽

L. Weaver

Keyword(s):

Infinite Horizon ◽

Companion Paper ◽

Gradient Algorithm ◽

Gradient Estimates ◽

Stochastic Gradient Algorithm ◽

Line Searches ◽

Gradient Ascent ◽

Policy Gradient ◽

Markov Decision ◽

Partially Observable

In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, this volume), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter beta, which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of (Baxter & Bartlett, this volume) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.

Download Full-text

Safe option-critic: learning safety in the option-critic architecture

The Knowledge Engineering Review ◽

10.1017/s0269888921000035 ◽

2021 ◽

Vol 36 ◽

Author(s):

Arushi Jain ◽

Khimya Khetarpal ◽

Doina Precup

Keyword(s):

Model Uncertainty ◽

Gradient Algorithm ◽

Intrinsic Variability ◽

Expected Return ◽

Practical Applications ◽

Hierarchical Reinforcement Learning ◽

Continuous State ◽

End Conditions ◽

Policy Gradient ◽

High Uncertainty

Abstract Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications but also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

A Comparison of Policy Iteration Methods for Solving Continuous-State, Infinite-Horizon Markovian Decision Problems Using Random, Quasi-random, and Deterministic Discretizations

SSRN Electronic Journal ◽

10.2139/ssrn.37768 ◽

1997 ◽

Cited By ~ 10

Author(s):

John P. Rust

Keyword(s):

Infinite Horizon ◽

Policy Iteration ◽

Decision Problems ◽

Continuous State ◽

Iteration Methods ◽

Markovian Decision Problems

Download Full-text

Charge-Based Capacitive Self-Sensing With Continuous State Observation for Resonant Electrostatic MEMS Mirrors

Journal of Microelectromechanical Systems ◽

10.1109/jmems.2021.3107797 ◽

2021 ◽

pp. 1-10

Author(s):

Richard Schroedter ◽

Han Woong Yoo ◽

David Brunner ◽

Georg Schitter

Keyword(s):

State Observation ◽

Electrostatic Mems ◽

Continuous State

Download Full-text

Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014213 ◽

2019 ◽

Vol 33 ◽

pp. 4213-4220 ◽

Cited By ~ 12

Author(s):

Shihui Li ◽

Yi Wu ◽

Xinyue Cui ◽

Honghua Dong ◽

Fei Fang ◽

...

Keyword(s):

Reinforcement Learning ◽

Gradient Algorithm ◽

Training Environment ◽

Local Optima ◽

Continuous Action ◽

Agent Learning ◽

Policy Gradient ◽

Multi Agent ◽

Continuous Actions ◽

Computational Intractability

Despite the recent advances of deep reinforcement learning (DRL), agents trained by DRL tend to be brittle and sensitive to the training environment, especially in the multi-agent scenarios. In the multi-agent setting, a DRL agent’s policy can easily get stuck in a poor local optima w.r.t. its training partners – the learned policy may be only locally optimal to other agents’ current policies. In this paper, we focus on the problem of training robust DRL agents with continuous actions in the multi-agent learning setting so that the trained agents can still generalize when its opponents’ policies alter. To tackle this problem, we proposed a new algorithm, MiniMax Multi-agent Deep Deterministic Policy Gradient (M3DDPG) with the following contributions: (1) we introduce a minimax extension of the popular multi-agent deep deterministic policy gradient algorithm (MADDPG), for robust policy learning; (2) since the continuous action space leads to computational intractability in our minimax learning objective, we propose Multi-Agent Adversarial Learning (MAAL) to efficiently solve our proposed formulation. We empirically evaluate our M3DDPG algorithm in four mixed cooperative and competitive multi-agent environments and the agents trained by our method significantly outperforms existing baselines.

Download Full-text

Enhanced Artificial Coronary Circulation System Algorithm for Truss Optimization with Multiple Natural Frequency Constraints

Periodica Polytechnica Civil Engineering ◽

10.3311/ppci.13562 ◽

2019 ◽

Author(s):

Ali Kaveh ◽

Mohsen Kooshkbaghi

Keyword(s):

Structural Optimization ◽

Coronary Circulation ◽

Dynamic Performance ◽

Optimization Methods ◽

Gradient Algorithm ◽

Truss Structures ◽

Truss Optimization ◽

Circulation System ◽

Frequency Constraints ◽

Local Optima

In this paper, an enhanced artificial coronary circulation system (EACCS) algorithm is applied to structural optimization with continuous design variables and frequency constraints. The standard algorithm, artificial coronary circulation system (ACCS), is inspired biologically as a non-gradient algorithm and mimics the growth of coronary tree of heart circulation system. Designs generated by the EACCS algorithm are compared with other popular evolutionary optimization methods, the objective function being the total weight of the structures.Truss optimization with frequency constraints has attracted substantial attention to improve the dynamic performance of structures. This kind of problems is believed to represent nonlinear and non-convex search spaces with several local optima. These problems are also suitable for examining the capabilities of the new algorithms. Here, ACCS is enhanced (EACCS) and employed for size and shape optimization of truss structures and six truss design problems are utilized for evaluating and validating of the EACCS. This algorithm uses a fitness-based weighted mean in the bifurcation phase and runner phase of the optimization process. The numerical results demonstrate successful performance, efficiency and robustness of the new method and its competitive performance to some other well-known meta-heuristics in structural optimization.

Download Full-text

Optimality criteria for deterministic discrete-time infinite horizon optimization

International Journal of Mathematics and Mathematical Sciences ◽

10.1155/ijmms.2005.57 ◽

2005 ◽

Vol 2005 (1) ◽

pp. 57-80 ◽

Cited By ~ 7

Author(s):

Irwin E. Schochetman ◽

Robert L. Smith

Keyword(s):

Discrete Time ◽

Efficient Solution ◽

Optimality Criterion ◽

Infinite Horizon ◽

Optimality Criteria ◽

Efficient Solutions ◽

Regularity Conditions ◽

Sufficient Condition ◽

Infinite Horizon Optimization ◽

Continuous State

We consider the problem of selecting an optimality criterion, when total costs diverge, in deterministic infinite horizon optimization over discrete time. Our formulation allows for both discrete and continuous state and action spaces, as well as time-varying, that is, nonstationary, data. The task is to choose a criterion that is neither too overselective, so thatnopolicy is optimal, nor too underselective, so thatmostpolicies are optimal. We contrast and compare the following optimality criteria: strong, overtaking, weakly overtaking, efficient, and average. However, our focus is on the optimality criterion of efficiency. (A solution isefficientif it is optimal to each of the states through which it passes.) Under mild regularity conditions, we show that efficient solutions always exist and thus are not overselective. As to underselectivity, we provide weak state reachability conditions which assure that every efficient solution is also average optimal, thus providing a sufficient condition for average optima to exist. Our main result concerns the case where the discounted per-period costs converge to zero, while the discounted total costs diverge to infinity. Under the assumption that we can reach from any feasible state any feasible sequence of states in bounded time, we show that every efficient solution is also overtaking, thus providing a sufficient condition for overtaking optima to exist.

Download Full-text

STOCHASTIC DISCRETIZATION FOR THE LONG-RUN AVERAGE REWARD IN FLUID MODELS

Probability in the Engineering and Informational Sciences ◽

10.1017/s0269964803172075 ◽

2003 ◽

Vol 17 (2) ◽

pp. 251-265 ◽

Cited By ~ 3

Author(s):

I.J.B.F. Adan ◽

J.A.C. Resing ◽

V.G. Kulkarni

Keyword(s):

Fluid Model ◽

Geometric Approach ◽

Random Variable ◽

Death Process ◽

Average Reward ◽

Discrete State ◽

Limiting Behavior ◽

Long Run ◽

Continuous State ◽

Long Run Average Reward

Stochastic discretization is a technique of representing a continuous random variable as a random sum of i.i.d. exponential random variables. In this article, we apply this technique to study the limiting behavior of a stochastic fluid model. Specifically, we consider an infinite-capacity fluid buffer, where the net input of fluid is regulated by a finite-state irreducible continuous-time Markov chain. Most long-run performance characteristics for such a fluid system can be expressed as the long-run average reward for a suitably chosen reward structure. In this article, we use stochastic discretization of the fluid content process to efficiently determine the long-run average reward. This method transforms the continuous-state Markov process describing the fluid model into a discrete-state quasi-birth–death process. Hence, standard tools, such as the matrix-geometric approach, become available for the analysis of the fluid buffer. To demonstrate this approach, we analyze the output of a buffer processing fluid from K sources on a first-come first-served basis.

Download Full-text

Nano-Filled Polymer Composites for Biomedical Applications

Volume 2: Biomedical and Biotechnology Engineering ◽

10.1115/imece2008-67759 ◽

2008 ◽

Author(s):

Maximiano V. Ramos ◽

Armstrong Frederick ◽

Ahmed M. Al-Jumaily

Keyword(s):

Mechanical Properties ◽

Polymer Composites ◽

Polymer Nanocomposites ◽

Biomedical Applications ◽

Mixing Time ◽

Correct Choice ◽

Processing Parameters ◽

Experimental Procedure ◽

Filled Polymer

Polymer nanocomposites offer various functional advantages required for several biomedical applications. For example, polymer nanocomposites are biocompatible, biodegradable, and can be engineered to have mechanical properties suitable for specific applications. The key to the use of polymer nanocomposites for different applications is the correct choice of matrix polymer chemistry, filler type, and matrix-filler interaction. This paper discusses the results of a study in the processing and characterization of nono-filled polymer composites and focuses on the improvement of its properties for potential biomedical applications. The experimental procedure for the preparation of nano-filled polymer composite by ultrasonic mixing is described. Different types of nanofillers and polymer matrix are studied. Effects of processing parameters such as percent loading of fillers, mixing time on the mechanical properties of the composites are discussed. Preliminary results indicate improvement in shear and flexural properties, tensile and compressive properties, were observed in the prepared composites for some processing conditions.

Download Full-text