Decision-Theoretic Planning: Structural Assumptions and Computational Leverage

Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives adopted in these areas often differ in substantial ways, many planning problems of interest to researchers in these fields can be modeled as Markov decision processes (MDPs) and analyzed using the techniques of decision theory. This paper presents an overview and synthesis of MDP-related methods, showing how they provide a unifying framework for modeling many classes of planning problems studied in AI. It also describes structural properties of MDPs that, when exhibited by particular classes of problems, can be exploited in the construction of optimal or approximately optimal policies or plans. Planning problems commonly possess structure in the reward and value functions used to describe performance criteria, in the functions used to describe state transitions and observations, and in the relationships among features used to describe states, actions, rewards, and observations. Specialized representations, and algorithms employing these representations, can achieve computational leverage by exploiting these various forms of structure. Certain AI techniques -- in particular those based on the use of structured, intensional representations -- can be viewed in this way. This paper surveys several types of representations for both classical and decision-theoretic planning problems, and planning algorithms that exploit these representations in a number of different ways to ease the computational burden of constructing policies or plans. It focuses primarily on abstraction, aggregation and decomposition techniques based on AI-style representations.

Download Full-text

Bridging Commonsense Reasoning and Probabilistic Planning via a Probabilistic Action Language

Theory and Practice of Logic Programming ◽

10.1017/s1471068419000371 ◽

2019 ◽

Vol 19 (5-6) ◽

pp. 1090-1106

Author(s):

YI WANG ◽

SHIQI ZHANG ◽

JOOHYUNG LEE

Keyword(s):

Planning Under Uncertainty ◽

Commonsense Reasoning ◽

Sequential Decision ◽

Unified Framework ◽

Probabilistic Planning ◽

Commonsense Knowledge ◽

Action Languages ◽

Action Language ◽

Markov Decision ◽

Probabilistic Action

AbstractTo be responsive to dynamically changing real-world environments, an intelligent agent needs to perform complex sequential decision-making tasks that are often guided by commonsense knowledge. The previous work on this line of research led to the framework called interleaved commonsense reasoning and probabilistic planning (icorpp), which used P-log for representing commmonsense knowledge and Markov Decision Processes (MDPs) or Partially Observable MDPs (POMDPs) for planning under uncertainty. A main limitation of icorpp is that its implementation requires non-trivial engineering efforts to bridge the commonsense reasoning and probabilistic planning formalisms. In this paper, we present a unified framework to integrate icorpp’s reasoning and planning components. In particular, we extend probabilistic action language pBC+ to express utility, belief states, and observation as in POMDP models. Inheriting the advantages of action languages, the new action language provides an elaboration tolerant representation of POMDP that reflects commonsense knowledge. The idea led to the design of the system pbcplus2pomdp, which compiles a pBC+ action description into a POMDP model that can be directly processed by off-the-shelf POMDP solvers to compute an optimal policy of the pBC+ action description. Our experiments show that it retains the advantages of icorpp while avoiding the manual efforts in bridging the commonsense reasoner and the probabilistic planner.

Download Full-text

Optimal Policies for Quantum Markov Decision Processes

International Journal of Automation and Computing ◽

10.1007/s11633-021-1278-z ◽

2021 ◽

Author(s):

Ming-Sheng Ying ◽

Yuan Feng ◽

Sheng-Gang Ying

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Quantum Systems ◽

Sequential Decision Making ◽

Mathematical Framework ◽

Sequential Decision ◽

Learning Techniques ◽

Optimal Policies ◽

Markov Decision ◽

Programming Algorithms

AbstractMarkov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random. In particular, it serves as a mathematical framework for reinforcement learning. This paper introduces an extension of MDP, namely quantum MDP (qMDP), that can serve as a mathematical model of decision making about quantum systems. We develop dynamic programming algorithms for policy evaluation and finding optimal policies for qMDPs in the case of finite-horizon. The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.

Download Full-text

POMDP-Based Coding of Child–Robot Interaction within a Robot-Assisted ASD Diagnostic Protocol

International Journal of Humanoid Robotics ◽

10.1142/s0219843618500111 ◽

2018 ◽

Vol 15 (02) ◽

pp. 1850011 ◽

Cited By ~ 9

Author(s):

Frano Petric ◽

Damjan Miklić ◽

Zdenko Kovačić

Keyword(s):

Humanoid Robots ◽

Autism Diagnostic Observation Schedule ◽

Autism Spectrum ◽

Diagnostic Process ◽

Sequential Decision ◽

Robot Assisted ◽

Robot Interaction ◽

Fine Grained ◽

Markov Decision ◽

Partially Observable

The existing procedures for autism spectrum disorder (ASD) diagnosis are often time consuming and tiresome both for highly-trained human evaluators and children, which may be alleviated by using humanoid robots in the diagnostic process. Hence, this paper proposes a framework for robot-assisted ASD evaluation based on partially observable Markov decision process (POMDP) modeling, specifically POMDPs with mixed observability (MOMDPs). POMDP is broadly used for modeling optimal sequential decision making tasks under uncertainty. Spurred by the widely accepted autism diagnostic observation schedule (ADOS), we emulate ADOS through four tasks, whose models incorporate observations of multiple social cues such as eye contact, gestures and utterances. Relying only on those observations, the robot provides an assessment of the child’s ASD-relevant functioning level (which is partially observable) within a particular task and provides human evaluators with readable information by partitioning its belief space. Finally, we evaluate the proposed MOMDP task models and demonstrate that chaining the tasks provides fine-grained outcome quantification, which could also increase the appeal of robot-assisted diagnostic protocols in the future.

Download Full-text

Deterministic policies based on maximum regrets in MDPs with imprecise rewards

AI Communications ◽

10.3233/aic-190632 ◽

2021 ◽

pp. 1-16

Author(s):

Pegah Alizadeh ◽

Emiliano Traversi ◽

Aomar Osmani

Keyword(s):

Decision Making ◽

Decision Process ◽

Process Models ◽

Sequential Decision Making ◽

Sequential Decision ◽

Exact Procedure ◽

Markov Decision ◽

Intuitive Idea ◽

First Time ◽

Maximum Regret

Markov Decision Process Models (MDPs) are a powerful tool for planning tasks and sequential decision-making issues. In this work we deal with MDPs with imprecise rewards, often used when dealing with situations where the data is uncertain. In this context, we provide algorithms for finding the policy that minimizes the maximum regret. To the best of our knowledge, all the regret-based methods proposed in the literature focus on providing an optimal stochastic policy. We introduce for the first time a method to calculate an optimal deterministic policy using optimization approaches. Deterministic policies are easily interpretable for users because for a given state they provide a unique choice. To better motivate the use of an exact procedure for finding a deterministic policy, we show some (theoretical and experimental) cases where the intuitive idea of using a deterministic policy obtained after “determinizing” the optimal stochastic policy leads to a policy far from the exact deterministic policy.

Download Full-text

Autonomous Vehicle Trajectory Planning under Uncertainty Using Stochastic Collocation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.580.175 ◽

2012 ◽

Vol 580 ◽

pp. 175-179 ◽

Cited By ~ 6

Author(s):

Hong Fu Liu ◽

Yu Zhang ◽

Shao Fei Chen ◽

Jing Chen

Keyword(s):

Trajectory Planning ◽

Autonomous Vehicle ◽

Stochastic Collocation ◽

Planning Under Uncertainty ◽

Boundary Constraints ◽

Probabilistic Uncertainty ◽

Path Control ◽

Vehicle Trajectory ◽

Planning Problems ◽

Optimal Trajectory Planning

We propose a framework based on stochastic collocation to solve autonomous vehicle optimal trajectory planning problems with probabilistic uncertainty. We model uncertainty from the location and size of obstacles. We develop stochastic pseudospectral methods to solve the minimum expectation cost of differential equation, which meets path, control, and boundary constraints. Results are shown on two examples of autonomous vehicle trajectory planning under uncertainty, which illustrated the feasibility and applicability of our method.

Download Full-text

Data-driven planning via imitation learning

The International Journal of Robotics Research ◽

10.1177/0278364918781001 ◽

2018 ◽

Vol 37 (13-14) ◽

pp. 1632-1672 ◽

Cited By ~ 4

Author(s):

Sanjiban Choudhury ◽

Mohak Bhardwaj ◽

Sankalp Arora ◽

Ashish Kapoor ◽

Gireeja Ranade ◽

...

Keyword(s):

Partial Information ◽

State Of The Art ◽

Imitation Learning ◽

Data Driven ◽

Sequential Decision ◽

Efficient Manner ◽

World State ◽

Performance Guarantees ◽

The World ◽

Planning Problems

Robot planning is the process of selecting a sequence of actions that optimize for a task=specific objective. For instance, the objective for a navigation task would be to find collision-free paths, whereas the objective for an exploration task would be to map unknown areas. The optimal solutions to such tasks are heavily influenced by the implicit structure in the environment, i.e. the configuration of objects in the world. State-of-the-art planning approaches, however, do not exploit this structure, thereby expending valuable effort searching the action space instead of focusing on potentially good actions. In this paper, we address the problem of enabling planners to adapt their search strategies by inferring such good actions in an efficient manner using only the information uncovered by the search up until that time. We formulate this as a problem of sequential decision making under uncertainty where at a given iteration a planning policy must map the state of the search to a planning action. Unfortunately, the training process for such partial-information-based policies is slow to converge and susceptible to poor local minima. Our key insight is that if we could fully observe the underlying world map, we would easily be able to disambiguate between good and bad actions. We hence present a novel data-driven imitation learning framework to efficiently train planning policies by imitating a clairvoyant oracle: an oracle that at train time has full knowledge about the world map and can compute optimal decisions. We leverage the fact that for planning problems, such oracles can be efficiently computed and derive performance guarantees for the learnt policy. We examine two important domains that rely on partial-information-based policies: informative path planning and search-based motion planning. We validate the approach on a spectrum of environments for both problem domains, including experiments on a real UAV, and show that the learnt policy consistently outperforms state-of-the-art algorithms. Our framework is able to train policies that achieve up to [Formula: see text] more reward than state-of-the art information-gathering heuristics and a [Formula: see text] speedup as compared with A* on search-based planning problems. Our approach paves the way forward for applying data-driven techniques to other such problem domains under the umbrella of robot planning.

Download Full-text

On the structure of multifactor optimal portfolio strategies

ESAIM Control Optimisation and Calculus of Variations ◽

10.1051/cocv/2017013 ◽

2018 ◽

Vol 24 (3) ◽

pp. 1043-1058

Author(s):

Nikolai Dokuchaev

Keyword(s):

Mutual Funds ◽

Portfolio Selection ◽

Optimal Strategies ◽

Optimal Portfolio ◽

Performance Criteria ◽

Value Functions ◽

Portfolio Strategies ◽

Optimal Portfolio Selection ◽

Admissible Strategies ◽

Hamilton Jacobi Bellman

The paper studies problem of optimal portfolio selection. It is shown that, under some mild conditions, near optimal strategies for investors with different performance criteria can be constructed using a limited number of fixed processes (mutual funds), for a market with a larger number of available risky stocks. This implies dimension reduction for the optimal portfolio selection problem: all rational investors may achieve optimality using the same mutual funds plus a saving account. This result is obtained under mild restrictions for the utility functions without any assumptions on regularity of the value function. The proof is based on the method of dynamic programming applied indirectly to some convenient approximations of the original problem that ensure certain regularity of the value functions. To overcome technical difficulties, we use special time dependent and random constraints for admissible strategies such that the corresponding HJB (Hamilton–Jacobi–Bellman) equation admits “almost explicit” solutions generating near optimal admissible strategies featuring sufficient regularity and integrability.

Download Full-text

Hierarchical Reinforcement Learning

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch122 ◽

2011 ◽

pp. 825-830

Author(s):

Carlos Diuk ◽

Michael Littman

Keyword(s):

Reinforcement Learning ◽

Learning Problems ◽

Underlying Structure ◽

Sequential Decision ◽

State Spaces ◽

Hierarchical Reinforcement Learning ◽

Markov Decision ◽

Finite Set ◽

State Abstraction ◽

Main Ideas

Reinforcement learning (RL) deals with the problem of an agent that has to learn how to behave to maximize its utility by its interactions with an environment (Sutton & Barto, 1998; Kaelbling, Littman & Moore, 1996). Reinforcement learning problems are usually formalized as Markov Decision Processes (MDP), which consist of a finite set of states and a finite number of possible actions that the agent can perform. At any given point in time, the agent is in a certain state and picks an action. It can then observe the new state this action leads to, and receives a reward signal. The goal of the agent is to maximize its long-term reward. In this standard formalization, no particular structure or relationship between states is assumed. However, learning in environments with extremely large state spaces is infeasible without some form of generalization. Exploiting the underlying structure of a problem can effect generalization and has long been recognized as an important aspect in representing sequential decision tasks (Boutilier et al., 1999). Hierarchical Reinforcement Learning is the subfield of RL that deals with the discovery and/or exploitation of this underlying structure. Two main ideas come into play in hierarchical RL. The first one is to break a task into a hierarchy of smaller subtasks, each of which can be learned faster and easier than the whole problem. Subtasks can also be performed multiple times in the course of achieving the larger task, reusing accumulated knowledge and skills. The second idea is to use state abstraction within subtasks: not every task needs to be concerned with every aspect of the state space, so some states can actually be abstracted away and treated as the same for the purpose of the given subtask.

Download Full-text

Convergence of Value Functions for Finite Horizon Markov Decision Processes with Constraints

Applied Mathematics & Optimization ◽

10.1007/s00245-020-09707-x ◽

2020 ◽

Author(s):

Naoyuki Ichihara

Keyword(s):

Markov Decision Processes ◽

Decision Processes ◽

Finite Horizon ◽

Value Functions ◽

Markov Decision

Download Full-text

Has Dynamic Programming Improved Decision Making?

Annual Review of Economics ◽

10.1146/annurev-economics-080218-025721 ◽

2019 ◽

Vol 11 (1) ◽

pp. 833-858 ◽

Cited By ~ 3

Author(s):

John Rust

Keyword(s):

Decision Making ◽

Dynamic Programming ◽

Real World ◽

Operations Research ◽

Decision Rules ◽

Decision Problems ◽

Optimal Decision ◽

Sequential Decision ◽

Success Stories ◽

Real World Applications

Dynamic programming (DP) is a powerful tool for solving a wide class of sequential decision-making problems under uncertainty. In principle, it enables us to compute optimal decision rules that specify the best possible decision in any situation. This article reviews developments in DP and contrasts its revolutionary impact on economics, operations research, engineering, and artificial intelligence with the comparative paucity of its real-world applications to improve the decision making of individuals and firms. The fuzziness of many real-world decision problems and the difficulty in mathematically modeling them are key obstacles to a wider application of DP in real-world settings. Nevertheless, I discuss several success stories, and I conclude that DP offers substantial promise for improving decision making if we let go of the empirically untenable assumption of unbounded rationality and confront the challenging decision problems faced every day by individuals and firms.

Download Full-text