A Simulation-Based Policy Iteration Algorithm for Average Cost Unichain Markov Decision Processes

Author(s):  
Ying He ◽  
Michael C. Fu ◽  
Steven I. Marcus
2003 ◽  
Vol 17 (2) ◽  
pp. 213-234 ◽  
Author(s):  
William L. Cooper ◽  
Shane G. Henderson ◽  
Mark E. Lewis

Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide easily verifiable conditions that ensure that simulation-based policy iteration almost-surely eventually never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the average evaluation equations. Using our general results, we derive simple conditions on the simulation run lengths that guarantee the almost-sure convergence of the algorithm.


1994 ◽  
Vol 31 (04) ◽  
pp. 979-990
Author(s):  
Jean B. Lasserre

We present two sufficient conditions for detection of optimal and non-optimal actions in (ergodic) average-cost MDPs. They are easily interpreted and can be implemented as detection tests in both policy iteration and linear programming methods. An efficient implementation of a recent new policy iteration scheme is discussed.


2009 ◽  
Vol 2009 ◽  
pp. 1-17 ◽  
Author(s):  
Quanxin Zhu ◽  
Xinsong Yang ◽  
Chuangxia Huang

We study thepolicy iteration algorithm(PIA) for continuous-time jump Markov decision processes in general state and action spaces. The corresponding transition rates are allowed to beunbounded, and the reward rates may haveneither upper nor lower bounds. The criterion that we are concerned with isexpected average reward. We propose a set of conditions under which we first establish the average reward optimality equation and present the PIA. Then under twoslightlydifferent sets of conditions we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation.


1994 ◽  
Vol 31 (4) ◽  
pp. 979-990 ◽  
Author(s):  
Jean B. Lasserre

We present two sufficient conditions for detection of optimal and non-optimal actions in (ergodic) average-cost MDPs. They are easily interpreted and can be implemented as detection tests in both policy iteration and linear programming methods. An efficient implementation of a recent new policy iteration scheme is discussed.


Sign in / Sign up

Export Citation Format

Share Document