Simple Regret Optimization in Online Planning for Markov Decision Processes

We consider online planning in Markov decision processes (MDPs). In online planning, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to perform next. Formally, the performance of algorithms for online planning is assessed in terms of simple regret, the agent's expected performance loss when the chosen action, rather than an optimal one, is followed. To date, state-of-the-art algorithms for online planning in general MDPs are either best effort, or guarantee only polynomial-rate reduction of simple regret over time. Here we introduce a new Monte-Carlo tree search algorithm, BRUE, that guarantees exponential-rate and smooth reduction of simple regret. At a high level, BRUE is based on a simple yet non-standard state-space sampling scheme, MCTS2e, in which different parts of each sample are dedicated to different exploratory objectives. We further extend BRUE with a variant of ``learning by forgetting.'' The resulting parametrized algorithm, BRUE(alpha), exhibits even more attractive formal guarantees than BRUE. Our empirical evaluation shows that both BRUE and its generalization, BRUE(alpha), are also very effective in practice and compare favorably to the state-of-the-art.

Download Full-text

An empirical evaluation of interval estimation for Markov decision processes

16th IEEE International Conference on Tools with Artificial Intelligence TAI-04 ◽

10.1109/ictai.2004.28 ◽

2005 ◽

Cited By ~ 8

Author(s):

A.L. Strehl ◽

M.L. Littman

Keyword(s):

Markov Decision Processes ◽

Empirical Evaluation ◽

Interval Estimation ◽

Decision Processes ◽

Markov Decision

Download Full-text

Kalman Temporal Differences

Journal of Artificial Intelligence Research ◽

10.1613/jair.3077 ◽

2010 ◽

Vol 39 ◽

pp. 483-532 ◽

Cited By ~ 29

Author(s):

M. Geist ◽

O. Pietquin

Keyword(s):

Markov Decision Processes ◽

Function Approximation ◽

Approximation Scheme ◽

State Of The Art ◽

Decision Processes ◽

Temporal Differences ◽

Special Cases ◽

Markov Decision ◽

Biased Estimates ◽

Q Function

Because reinforcement learning suffers from a lack of scalability, online value (and Q-) function approximation has received increasing interest this last decade. This contribution introduces a novel approximation scheme, namely the Kalman Temporal Differences (KTD) framework, that exhibits the following features: sample-efficiency, non-linear approximation, non-stationarity handling and uncertainty management. A first KTD-based algorithm is provided for deterministic Markov Decision Processes (MDP) which produces biased estimates in the case of stochastic transitions. Than the eXtended KTD framework (XKTD), solving stochastic MDP, is described. Convergence is analyzed for special cases for both deterministic and stochastic transitions. Related algorithms are experimented on classical benchmarks. They compare favorably to the state of the art while exhibiting the announced features.

Download Full-text

Enforcing Almost-Sure Reachability in POMDPs

Computer Aided Verification - Lecture Notes in Computer Science ◽

10.1007/978-3-030-81688-9_28 ◽

2021 ◽

pp. 602-625

Author(s):

Sebastian Junges ◽

Nils Jansen ◽

Sanjit A. Seshia

Keyword(s):

Markov Decision Processes ◽

Empirical Evaluation ◽

Decision Processes ◽

Limited Information ◽

Sequential Decision ◽

Goal State ◽

Learning Agent ◽

Markov Decision ◽

System Configurations ◽

Partially Observable

AbstractPartially-Observable Markov Decision Processes (POMDPs) are a well-known stochastic model for sequential decision making under limited information. We consider the EXPTIME-hard problem of synthesising policies that almost-surely reach some goal state without ever visiting a bad state. In particular, we are interested in computing the winning region, that is, the set of system configurations from which a policy exists that satisfies the reachability specification. A direct application of such a winning region is the safe exploration of POMDPs by, for instance, restricting the behavior of a reinforcement learning agent to the region. We present two algorithms: A novel SAT-based iterative approach and a decision-diagram based alternative. The empirical evaluation demonstrates the feasibility and efficacy of the approaches.

Download Full-text

Point-Based Monte Carto Online Planning in POMDPs

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.846-847.1388 ◽

2013 ◽

Vol 846-847 ◽

pp. 1388-1391

Author(s):

Bo Wu ◽

Yan Peng Feng ◽

Hong Yan Zheng

Keyword(s):

Mean Squared Error ◽

Search Algorithm ◽

Search Tree ◽

Real Time System ◽

Monte Carlo Tree Search ◽

Online Planning ◽

Markov Decision ◽

Partially Observable ◽

Belief States ◽

Tree Search Algorithm

The online planning and learning in partially observable Markov decision processes are often intractable because belief states space has two curses: dimensionality and history. In order to address this problem, this paper proposes a point-based Monte Carto online planning approach in POMDPs. This approach involves performing value backup at specific reachable belief points, rather than over the entire belief simplex, to speed up computation processes. Then Monte Carlo tree search algorithm is exploited to share the value of actions across each subtree of the search tree so as to minimise the mean squared error. The experimental results show that the proposed algorithm is effective in real-time system.

Download Full-text

Decentralized control of multi-robot partially observable Markov decision processes using belief space macro-actions

The International Journal of Robotics Research ◽

10.1177/0278364917692864 ◽

2017 ◽

Vol 36 (2) ◽

pp. 231-258 ◽

Cited By ~ 14

Author(s):

Shayegan Omidshafiei ◽

Ali–Akbar Agha–Mohammadi ◽

Christopher Amato ◽

Shih–Yuan Liu ◽

Jonathan P How ◽

...

Keyword(s):

Markov Decision Processes ◽

Large Scale ◽

Decision Processes ◽

Delivery Problem ◽

Package Delivery ◽

Markov Decision ◽

Partially Observable Markov ◽

High Level ◽

Partially Observable ◽

Multi Robot

This work focuses on solving general multi-robot planning problems in continuous spaces with partial observability given a high-level domain description. Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) are general models for multi-robot coordination problems. However, representing and solving Dec-POMDPs is often intractable for large problems. This work extends the Dec-POMDP model to the Decentralized Partially Observable Semi-Markov Decision Process (Dec-POSMDP) to take advantage of the high-level representations that are natural for multi-robot problems and to facilitate scalable solutions to large discrete and continuous problems. The Dec-POSMDP formulation uses task macro-actions created from lower-level local actions that allow for asynchronous decision-making by the robots, which is crucial in multi-robot domains. This transformation from Dec-POMDPs to Dec-POSMDPs with a finite set of automatically-generated macro-actions allows use of efficient discrete-space search algorithms to solve them. The paper presents algorithms for solving Dec-POSMDPs, which are more scalable than previous methods since they can incorporate closed-loop belief space macro-actions in planning. These macro-actions are automatically constructed to produce robust solutions. The proposed algorithms are then evaluated on a complex multi-robot package delivery problem under uncertainty, showing that our approach can naturally represent realistic problems and provide high-quality solutions for large-scale problems.

Download Full-text

An Evolutionary Random Policy Search Algorithm for Solving Markov Decision Processes

INFORMS Journal on Computing ◽

10.1287/ijoc.1050.0155 ◽

2007 ◽

Vol 19 (2) ◽

pp. 161-174 ◽

Cited By ~ 7

Author(s):

Jiaqiao Hu ◽

Michael C. Fu ◽

Vahid R. Ramezani ◽

Steven I. Marcus

Keyword(s):

Markov Decision Processes ◽

Search Algorithm ◽

Decision Processes ◽

Policy Search ◽

Markov Decision

Download Full-text

Online Planning for Large Markov Decision Processes with Hierarchical Decomposition

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/2717316 ◽

2015 ◽

Vol 6 (4) ◽

pp. 1-28 ◽

Cited By ~ 14

Author(s):

Aijun Bai ◽

Feng Wu ◽

Xiaoping Chen

Keyword(s):

Markov Decision Processes ◽

Decision Processes ◽

Hierarchical Decomposition ◽

Online Planning ◽

Markov Decision

Download Full-text

Monte Carlo Tree Search for Verifying Reachability in Markov Decision Processes

Leveraging Applications of Formal Methods, Verification and Validation. Verification - Lecture Notes in Computer Science ◽

10.1007/978-3-030-03421-4_21 ◽

2018 ◽

pp. 322-335 ◽

Cited By ~ 2

Author(s):

Pranav Ashok ◽

Tomáš Brázdil ◽

Jan Křetínský ◽

Ondřej Slámečka

Keyword(s):

Monte Carlo ◽

Markov Decision Processes ◽

Decision Processes ◽

Tree Search ◽

Monte Carlo Tree Search ◽

Markov Decision

Download Full-text

Markov Decision Processes, AlphaGo, and Monte Carlo Tree Search: Back to the Future

The Operations Research Revolution ◽

10.1287/educ.2017.0166 ◽

2017 ◽

pp. 68-88 ◽

Cited By ~ 2

Author(s):

Michael C. Fu

Keyword(s):

Monte Carlo ◽

Markov Decision Processes ◽

Decision Processes ◽

Tree Search ◽

Monte Carlo Tree Search ◽

The Future ◽

Markov Decision

Download Full-text

Too Many cooks: Bayesian inference for coordinating Multi-agent Collaboration

10.1093/oso/9780198862536.003.0008 ◽

2021 ◽

pp. 152-170

Author(s):

Rose E. Wang ◽

Sarah A. Wu ◽

James A. Evans ◽

David C. Parkes ◽

Joshua B. Tenenbaum ◽

...

Keyword(s):

Markov Decision Processes ◽

Ad Hoc ◽

Prior Experience ◽

Decision Processes ◽

Learning Mechanism ◽

Agent Learning ◽

Agent Collaboration ◽

Markov Decision ◽

Multi Agent ◽

High Level

Collaboration requires agents to coordinate their behavior on the fly, sometimes cooperating to solve a single task together and other times dividing it up into sub-tasks to work on in parallel. Here, we develop Bayesian Delegation, a decentralized multi-agent learning mechanism with these abilities. Bayesian Delegation enables agents to rapidly infer the hidden intentions of others by inverse planning. We test Bayesian Delegation in a suite of multi-agent Markov decision processes inspired by cooking problems. On these tasks, agents with Bayesian Delegation coordinate both their high-level plans (e.g. what sub-task they should work on) and their low-level actions (e.g. avoiding getting in each other’s way). In a self-play evaluation, Bayesian Delegation outperforms alternative algorithms. Bayesian Delegation is also a capable ad-hoc collaborator and successfully coordinates with other agent types even in the absence of prior experience.

Download Full-text