Simulation-Based Algorithms for Markov Decision Processes: Monte Carlo Tree Search from AlphaGo to AlphaZero

AlphaGo and its successors AlphaGo Zero and AlphaZero made international headlines with their incredible successes in game playing, which have been touted as further evidence of the immense potential of artificial intelligence, and in particular, machine learning. AlphaGo defeated the reigning human world champion Go player Lee Sedol 4 games to 1, in March 2016 in Seoul, Korea, an achievement that surpassed previous computer game-playing program milestones by IBM’s Deep Blue in chess and by IBM’s Watson in the U.S. TV game show Jeopardy. AlphaGo then followed this up by defeating the world’s number one Go player Ke Jie 3-0 at the Future of Go Summit in Wuzhen, China in May 2017. Then, in December 2017, AlphaZero stunned the chess world by dominating the top computer chess program Stockfish (which has a far higher rating than any human) in a 100-game match by winning 28 games and losing none (72 draws) after training from scratch for just four hours! The deep neural networks of AlphaGo, AlphaZero, and all their incarnations are trained using a technique called Monte Carlo tree search (MCTS), whose roots can be traced back to an adaptive multistage sampling (AMS) simulation-based algorithm for Markov decision processes (MDPs) published in Operations Research back in 2005 [Chang, HS, MC Fu, J Hu and SI Marcus (2005). An adaptive sampling algorithm for solving Markov decision processes. Operations Research, 53, 126–139.] (and introduced even earlier in 2002). After reviewing the history and background of AlphaGo through AlphaZero, the origins of MCTS are traced back to simulation-based algorithms for MDPs, and its role in training the neural networks that essentially carry out the value/policy function approximation used in approximate dynamic programming, reinforcement learning, and neuro-dynamic programming is discussed, including some recently proposed enhancements building on statistical ranking & selection research in the operations research simulation community.

Download Full-text

Monte Carlo Tree Search for Verifying Reachability in Markov Decision Processes

Leveraging Applications of Formal Methods, Verification and Validation. Verification - Lecture Notes in Computer Science ◽

10.1007/978-3-030-03421-4_21 ◽

2018 ◽

pp. 322-335 ◽

Cited By ~ 2

Author(s):

Pranav Ashok ◽

Tomáš Brázdil ◽

Jan Křetínský ◽

Ondřej Slámečka

Keyword(s):

Monte Carlo ◽

Markov Decision Processes ◽

Decision Processes ◽

Tree Search ◽

Monte Carlo Tree Search ◽

Markov Decision

Download Full-text

Markov Decision Processes, AlphaGo, and Monte Carlo Tree Search: Back to the Future

The Operations Research Revolution ◽

10.1287/educ.2017.0166 ◽

2017 ◽

pp. 68-88 ◽

Cited By ~ 2

Author(s):

Michael C. Fu

Keyword(s):

Monte Carlo ◽

Markov Decision Processes ◽

Decision Processes ◽

Tree Search ◽

Monte Carlo Tree Search ◽

The Future ◽

Markov Decision

Download Full-text

PAC bounds for simulation-based optimization of Markov decision processes

2007 46th IEEE Conference on Decision and Control ◽

10.1109/cdc.2007.4435050 ◽

2007 ◽

Keyword(s):

Markov Decision Processes ◽

Decision Processes ◽

Simulation Based ◽

Simulation Based Optimization ◽

Markov Decision

Download Full-text

Improving Monte-Carlo tree search for dots-and-boxes with a novel board representation and artificial neural networks

2015 IEEE Conference on Computational Intelligence and Games (CIG) ◽

10.1109/cig.2015.7317912 ◽

2015 ◽

Cited By ~ 1

Author(s):

Yimeng Zhuang ◽

Shuqin Li ◽

Tom Vincent Peters ◽

Chenguang Zhang

Keyword(s):

Neural Networks ◽

Monte Carlo ◽

Artificial Neural Networks ◽

Tree Search ◽

Monte Carlo Tree Search ◽

Artificial Neural ◽

Board Representation

Download Full-text

A dynamic programming algorithm for decentralized Markov decision processes with a broadcast structure

49th IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc.2010.5718187 ◽

2010 ◽

Cited By ~ 12

Author(s):

Jeff Wu ◽

Sanjay Lall

Keyword(s):

Dynamic Programming ◽

Markov Decision Processes ◽

Dynamic Programming Algorithm ◽

Decision Processes ◽

Programming Algorithm ◽

Markov Decision

Download Full-text

Simulation‐based Uniform Value Function Estimates of Markov Decision Processes

SIAM Journal on Control and Optimization ◽

10.1137/040619508 ◽

2006 ◽

Vol 45 (5) ◽

pp. 1633-1656 ◽

Cited By ~ 13

Author(s):

Rahul Jain ◽

Pravin P. Varaiya

Keyword(s):

Markov Decision Processes ◽

Value Function ◽

Decision Processes ◽

Uniform Value ◽

Simulation Based ◽

Markov Decision

Download Full-text

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Technometrics ◽

10.1080/00401706.1995.10484354 ◽

1995 ◽

Vol 37 (3) ◽

pp. 353-353 ◽

Cited By ~ 6

Author(s):

Laurence A. Baxter

Keyword(s):

Dynamic Programming ◽

Markov Decision Processes ◽

Stochastic Dynamic Programming ◽

Decision Processes ◽

Stochastic Dynamic ◽

Markov Decision

Download Full-text

Simple Regret Optimization in Online Planning for Markov Decision Processes

Journal of Artificial Intelligence Research ◽

10.1613/jair.4432 ◽

2014 ◽

Vol 51 ◽

pp. 165-205 ◽

Cited By ~ 5

Author(s):

Z. Feldman ◽

C. Domshlak

Keyword(s):

Markov Decision Processes ◽

State Of The Art ◽

Search Algorithm ◽

Empirical Evaluation ◽

Decision Processes ◽

Monte Carlo Tree Search ◽

Performance Loss ◽

Online Planning ◽

Markov Decision ◽

High Level

We consider online planning in Markov decision processes (MDPs). In online planning, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to perform next. Formally, the performance of algorithms for online planning is assessed in terms of simple regret, the agent's expected performance loss when the chosen action, rather than an optimal one, is followed. To date, state-of-the-art algorithms for online planning in general MDPs are either best effort, or guarantee only polynomial-rate reduction of simple regret over time. Here we introduce a new Monte-Carlo tree search algorithm, BRUE, that guarantees exponential-rate and smooth reduction of simple regret. At a high level, BRUE is based on a simple yet non-standard state-space sampling scheme, MCTS2e, in which different parts of each sample are dedicated to different exploratory objectives. We further extend BRUE with a variant of ``learning by forgetting.'' The resulting parametrized algorithm, BRUE(alpha), exhibits even more attractive formal guarantees than BRUE. Our empirical evaluation shows that both BRUE and its generalization, BRUE(alpha), are also very effective in practice and compare favorably to the state-of-the-art.

Download Full-text

Learning and Optimal Control of Imprecise Markov Decision Processes by Dynamic Programming Using the Imprecise Dirichlet Model

Soft Methodology and Random Information Systems ◽

10.1007/978-3-540-44465-7_16 ◽

2004 ◽

pp. 141-148 ◽

Cited By ~ 1

Author(s):

Matthias C. M. Troffaes

Keyword(s):

Optimal Control ◽

Dynamic Programming ◽

Markov Decision Processes ◽

Decision Processes ◽

Markov Decision ◽

Imprecise Dirichlet Model ◽

Dirichlet Model

Download Full-text

Cost rate heuristics for semi-Markov decision processes

Journal of Applied Probability ◽

10.1017/s002190020004345x ◽

1992 ◽

Vol 29 (03) ◽

pp. 633-644

Author(s):

K. D. Glazebrook ◽

Michael P. Bailey ◽

Lyn R. Whitaker

Keyword(s):

Dynamic Programming ◽

Markov Decision Processes ◽

Preventive Maintenance ◽

Decision Processes ◽

Cost Rate ◽

Backwards Induction ◽

Optimal Policies ◽

Markov Decision ◽

Speed Of Evolution

In response to the computational complexity of the dynamic programming/backwards induction approach to the development of optimal policies for semi-Markov decision processes, we propose a class of heuristics resulting from an inductive process which proceeds forwards in time. These heuristics always choose actions in such a way as to minimize some measure of the current cost rate. We describe a procedure for calculating such cost rate heuristics. The quality of the performance of such policies is related to the speed of evolution (in a cost sense) of the process. A simple model of preventive maintenance is described in detail. Cost rate heuristics for this problem are calculated and assessed computationally.

Download Full-text