Experiments with Infinite-Horizon, Policy-Gradient Estimation

In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, this volume), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter beta, which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of (Baxter & Bartlett, this volume) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.

Download Full-text

Infinite-Horizon Policy-Gradient Estimation

Journal of Artificial Intelligence Research ◽

10.1613/jair.806 ◽

2001 ◽

Vol 15 ◽

pp. 319-350 ◽

Cited By ~ 208

Author(s):

J. Baxter ◽

P. L. Bartlett

Keyword(s):

Mixing Time ◽

Infinite Horizon ◽

Correct Choice ◽

Companion Paper ◽

Gradient Algorithm ◽

Gradient Estimates ◽

Average Reward ◽

Local Optima ◽

State Observation ◽

Continuous State

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes POMDPs controlled by parameterized stochastic policies. A similar algorithm was proposed by (Kimura et al. 1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free beta (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter beta is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter et al., this volume) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

A superharmonic approach to solving infinite horizon partially observable Markov decision problems

Mathematical Methods of Operations Research ◽

10.1007/bf01415066 ◽

1995 ◽

Vol 41 (1) ◽

pp. 71-88 ◽

Cited By ~ 2

Author(s):

D. J. White

Keyword(s):

Infinite Horizon ◽

Decision Problems ◽

Markov Decision Problems ◽

Markov Decision ◽

Partially Observable Markov ◽

Partially Observable

Download Full-text

Goal-HSVI: Heuristic Search Value Iteration for Goal POMDPs

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/662 ◽

2018 ◽

Cited By ~ 1

Author(s):

Karel Horák ◽

Branislav Bošanský ◽

Krishnendu Chatterjee

Keyword(s):

Heuristic Search ◽

Infinite Horizon ◽

Decision Processes ◽

Value Iteration ◽

Planning Under Uncertainty ◽

Total Cost ◽

Markov Decision ◽

Standard Models ◽

Target States ◽

Partially Observable

Partially observable Markov decision processes (POMDPs) are the standard models for planning under uncertainty with both finite and infinite horizon. Besides the well-known discounted-sum objective, indefinite-horizon objective (aka Goal-POMDPs) is another classical objective for POMDPs. In this case, given a set of target states and a positive cost for each transition, the optimization objective is to minimize the expected total cost until a target state is reached. In the literature, RTDP-Bel or heuristic search value iteration (HSVI) have been used for solving Goal-POMDPs. Neither of these algorithms has theoretical convergence guarantees, and HSVI may even fail to terminate its trials. We give the following contributions: (1) We discuss the challenges introduced in Goal-POMDPs and illustrate how they prevent the original HSVI from converging. (2) We present a novel algorithm inspired by HSVI, termed Goal-HSVI, and show that our algorithm has convergence guarantees. (3) We show that Goal-HSVI outperforms RTDP-Bel on a set of well-known examples.

Download Full-text

OPTIMAL CONTROL FOR PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES OVER AN INFINITE HORIZON

Journal of the Operations Research Society of Japan ◽

10.15807/jorsj.21.1 ◽

1978 ◽

Vol 21 (1) ◽

pp. 1-16 ◽

Cited By ~ 7

Author(s):

Katsushige Sawaki ◽

Akira Ichikawa

Keyword(s):

Optimal Control ◽

Markov Decision Processes ◽

Infinite Horizon ◽

Decision Processes ◽

Markov Decision ◽

Partially Observable Markov ◽

Partially Observable

Download Full-text

PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES AND PERIODIC POLICIES WITH APPLICATIONS

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622011004762 ◽

2011 ◽

Vol 10 (06) ◽

pp. 1175-1197 ◽

Cited By ~ 1

Author(s):

JOHN GOULIONIS ◽

D. STENGOS

Keyword(s):

Markov Decision Processes ◽

Piecewise Linear ◽

Linear Equations ◽

Infinite Horizon ◽

Decision Processes ◽

Value Functions ◽

Discounted Cost ◽

Markov Decision ◽

Partially Observable Markov ◽

Partially Observable

This paper treats the infinite horizon discounted cost control problem for partially observable Markov decision processes. Sondik studied the class of finitely transient policies and showed that their value functions over an infinite time horizon are piecewise linear (p.w.l) and can be computed exactly by solving a system of linear equations. However, the condition for finite transience is stronger than is needed to ensure p.w.l. value functions. In this paper, we introduce alternatively the class of periodic policies whose value functions turn out to be also p.w.l. Moreover, we examine a more general condition than finite transience and periodicity that ensures p.w.l. value functions. We implement these ideas in a replacement problem under Markovian deterioration, investigate for periodic policies and give numerical examples.

Download Full-text

LSTM-DDPG for Trading with Variable Positions

Sensors ◽

10.3390/s21196571 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6571

Author(s):

Zhichao Jia ◽

Qiang Gao ◽

Xiaohong Peng

Keyword(s):

Short Term Memory ◽

Index Futures ◽

Reward Function ◽

Variable Position ◽

Policy Gradient ◽

Trading Decisions ◽

Markov Decision ◽

Market State ◽

Lstm Network ◽

Partially Observable

In recent years, machine learning for trading has been widely studied. The direction and size of position should be determined in trading decisions based on market conditions. However, there is no research so far that considers variable position sizes in models developed for trading purposes. In this paper, we propose a deep reinforcement learning model named LSTM-DDPG to make trading decisions with variable positions. Specifically, we consider the trading process as a Partially Observable Markov Decision Process, in which the long short-term memory (LSTM) network is used to extract market state features and the deep deterministic policy gradient (DDPG) framework is used to make trading decisions concerning the direction and variable size of position. We test the LSTM-DDPG model on IF300 (index futures of China stock market) data and the results show that LSTM-DDPG with variable positions performs better in terms of return and risk than models with fixed or few-level positions. In addition, the investment potential of the model can be better tapped by the reward function of the differential Sharpe ratio than that of profit reward function.

Download Full-text

UAV maneuvering decision -making algorithm based on Twin Delayed Deep Deterministic Policy Gradient Algorithm

Journal of Artificial Intelligence and Technology ◽

10.37965/jait.2021.12003 ◽

2021 ◽

Author(s):

Shuangxia Bai ◽

Shaomei Song ◽

Shiyang Liang ◽

Jianmei Wang ◽

Bo Li ◽

...

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Simulation Experiment ◽

Gradient Algorithm ◽

Intelligent Decision Making ◽

Intelligent Decision ◽

Air Combat ◽

Policy Gradient ◽

Markov Decision ◽

Combat Problems

Aiming at intelligent decision-making of UAV based on situation information in air combat, a novel maneuvering decision method based on deep reinforcement learning is proposed in this paper. The autonomous maneuvering model of UAV is established by Markov Decision Process. The Twin Delayed Deep Deterministic Policy Gradient(TD3) algorithm and the Deep Deterministic Policy Gradient (DDPG) algorithm in deep reinforcement learning are used to train the model, and the experimental results of the two algorithms are analyzed and compared. The simulation experiment results show that compared with the DDPG algorithm, the TD3 algorithm has stronger decision-making performance and faster convergence speed, and is more suitable forsolving combat problems. The algorithm proposed in this paper enables UAVs to autonomously make maneuvering decisions based on situation information such as position, speed, and relative azimuth, adjust their actions to approach and successfully strike the enemy, providing a new method for UAVs to make intelligent maneuvering decisions during air combat.

Download Full-text

Optimality of Multichannel Myopic Sensing in the Presence of Sensing Error for Opportunistic Spectrum Access

Journal of Applied Mathematics ◽

10.1155/2013/791789 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Xiaofeng Jiang ◽

Hongsheng Xi

Keyword(s):

Optimization Problem ◽

Infinite Horizon ◽

Practical Interest ◽

Opportunistic Spectrum Access ◽

Spectrum Access ◽

Opportunistic Access ◽

Markov Decision ◽

Myopic Policy ◽

Limited Sensing ◽

Partially Observable

The optimization problem for the performance of opportunistic spectrum access is considered in this study. A user, with the limited sensing capacity, has opportunistic access to a communication system with multiple channels. The user can only choose several channels to sense and decides whether to access these channels based on the sensing information in each time slot. Meanwhile, the presence of sensing error is considered. A reward is obtained when the user accesses a channel. The objective is to maximize the expected (discounted or average) reward accrued over an infinite horizon. This problem can be formulated as a partially observable Markov decision process. This study shows the optimality of the simple and robust myopic policy which focuses on maximizing the immediate reward. The results show that the myopic policy is optimal in the case of practical interest.

Download Full-text