A Simulation-Based Policy Iteration Algorithm for Average Cost Unichain Markov Decision Processes

Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide easily verifiable conditions that ensure that simulation-based policy iteration almost-surely eventually never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the average evaluation equations. Using our general results, we derive simple conditions on the simulation run lengths that guarantee the almost-sure convergence of the algorithm.

Download Full-text

A policy iteration algorithm for Markov decision processes skip-free in one direction

Proceedings of the 2nd International ICST Conference on Performance Evaluation Methodologies and Tools ◽

10.4108/smctools.2007.1948 ◽

2007 ◽

Cited By ~ 3

Author(s):

J. Lambert ◽

B. Van Houdt ◽

C. Blondia

Keyword(s):

Markov Decision Processes ◽

Policy Iteration ◽

Decision Processes ◽

Iteration Algorithm ◽

Markov Decision ◽

Policy Iteration Algorithm

Download Full-text

Detecting optimal and non-optimal actions in average-cost Markov decision processes

Journal of Applied Probability ◽

10.1017/s0021900200099502 ◽

1994 ◽

Vol 31 (04) ◽

pp. 979-990

Author(s):

Jean B. Lasserre

Keyword(s):

Linear Programming ◽

Markov Decision Processes ◽

Average Cost ◽

Sufficient Conditions ◽

Iteration Scheme ◽

Policy Iteration ◽

Decision Processes ◽

Ergodic Average ◽

Linear Programming Methods ◽

Markov Decision

We present two sufficient conditions for detection of optimal and non-optimal actions in (ergodic) average-cost MDPs. They are easily interpreted and can be implemented as detection tests in both policy iteration and linear programming methods. An efficient implementation of a recent new policy iteration scheme is discussed.

Download Full-text

Policy iteration and Newton-Raphson methods for Markov decision processes under average cost criterion

Computers & Mathematics with Applications ◽

10.1016/0898-1221(92)90240-i ◽

1992 ◽

Vol 24 (1-2) ◽

pp. 147-155 ◽

Cited By ~ 3

Author(s):

Masamitsu Ohnishi

Keyword(s):

Markov Decision Processes ◽

Average Cost ◽

Policy Iteration ◽

Decision Processes ◽

Average Cost Criterion ◽

Cost Criterion ◽

Markov Decision ◽

Newton Raphson

Download Full-text

Toward an optimized value iteration algorithm for average cost Markov decision processes

49th IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc.2010.5717895 ◽

2010 ◽

Cited By ~ 3

Author(s):

Edilson F. Arruda ◽

Fabricio Ourique ◽

Anthony Almudevar

Keyword(s):

Markov Decision Processes ◽

Average Cost ◽

Decision Processes ◽

Iteration Algorithm ◽

Value Iteration ◽

Markov Decision ◽

Value Iteration Algorithm

Download Full-text

Policy Iteration for Continuous-Time Average Reward Markov Decision Processes in Polish Spaces

Abstract and Applied Analysis ◽

10.1155/2009/103723 ◽

2009 ◽

Vol 2009 ◽

pp. 1-17 ◽

Cited By ~ 2

Author(s):

Quanxin Zhu ◽

Xinsong Yang ◽

Chuangxia Huang

Keyword(s):

Markov Decision Processes ◽

Continuous Time ◽

Policy Iteration ◽

Decision Processes ◽

Iteration Algorithm ◽

Average Reward ◽

Stationary Policy ◽

Optimality Equation ◽

Markov Decision ◽

Average Reward Optimality

We study thepolicy iteration algorithm(PIA) for continuous-time jump Markov decision processes in general state and action spaces. The corresponding transition rates are allowed to beunbounded, and the reward rates may haveneither upper nor lower bounds. The criterion that we are concerned with isexpected average reward. We propose a set of conditions under which we first establish the average reward optimality equation and present the PIA. Then under twoslightlydifferent sets of conditions we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation.

Download Full-text

Detecting optimal and non-optimal actions in average-cost Markov decision processes

Journal of Applied Probability ◽

10.2307/3215322 ◽

1994 ◽

Vol 31 (4) ◽

pp. 979-990 ◽

Cited By ~ 3

Author(s):

Jean B. Lasserre

Keyword(s):

Linear Programming ◽

Markov Decision Processes ◽

Average Cost ◽

Sufficient Conditions ◽

Iteration Scheme ◽

Policy Iteration ◽

Decision Processes ◽

Ergodic Average ◽

Linear Programming Methods ◽

Markov Decision

We present two sufficient conditions for detection of optimal and non-optimal actions in (ergodic) average-cost MDPs. They are easily interpreted and can be implemented as detection tests in both policy iteration and linear programming methods. An efficient implementation of a recent new policy iteration scheme is discussed.

Download Full-text

Policy iteration for parameterized Markov decision processes and its application

2013 9th Asian Control Conference (ASCC) ◽

10.1109/ascc.2013.6606023 ◽

2013 ◽

Cited By ~ 2

Author(s):

Li Xia ◽

Qing-Shan Jia

Keyword(s):

Markov Decision Processes ◽

Policy Iteration ◽

Decision Processes ◽

Markov Decision

Download Full-text