scholarly journals Decision-Theoretic Planning: Structural Assumptions and Computational Leverage

1999 ◽  
Vol 11 ◽  
pp. 1-94 ◽  
Author(s):  
C. Boutilier ◽  
T. Dean ◽  
S. Hanks

Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives adopted in these areas often differ in substantial ways, many planning problems of interest to researchers in these fields can be modeled as Markov decision processes (MDPs) and analyzed using the techniques of decision theory. This paper presents an overview and synthesis of MDP-related methods, showing how they provide a unifying framework for modeling many classes of planning problems studied in AI. It also describes structural properties of MDPs that, when exhibited by particular classes of problems, can be exploited in the construction of optimal or approximately optimal policies or plans. Planning problems commonly possess structure in the reward and value functions used to describe performance criteria, in the functions used to describe state transitions and observations, and in the relationships among features used to describe states, actions, rewards, and observations. Specialized representations, and algorithms employing these representations, can achieve computational leverage by exploiting these various forms of structure. Certain AI techniques -- in particular those based on the use of structured, intensional representations -- can be viewed in this way. This paper surveys several types of representations for both classical and decision-theoretic planning problems, and planning algorithms that exploit these representations in a number of different ways to ease the computational burden of constructing policies or plans. It focuses primarily on abstraction, aggregation and decomposition techniques based on AI-style representations.

2019 ◽  
Vol 19 (5-6) ◽  
pp. 1090-1106
Author(s):  
YI WANG ◽  
SHIQI ZHANG ◽  
JOOHYUNG LEE

AbstractTo be responsive to dynamically changing real-world environments, an intelligent agent needs to perform complex sequential decision-making tasks that are often guided by commonsense knowledge. The previous work on this line of research led to the framework called interleaved commonsense reasoning and probabilistic planning (icorpp), which used P-log for representing commmonsense knowledge and Markov Decision Processes (MDPs) or Partially Observable MDPs (POMDPs) for planning under uncertainty. A main limitation of icorpp is that its implementation requires non-trivial engineering efforts to bridge the commonsense reasoning and probabilistic planning formalisms. In this paper, we present a unified framework to integrate icorpp’s reasoning and planning components. In particular, we extend probabilistic action language pBC+ to express utility, belief states, and observation as in POMDP models. Inheriting the advantages of action languages, the new action language provides an elaboration tolerant representation of POMDP that reflects commonsense knowledge. The idea led to the design of the system pbcplus2pomdp, which compiles a pBC+ action description into a POMDP model that can be directly processed by off-the-shelf POMDP solvers to compute an optimal policy of the pBC+ action description. Our experiments show that it retains the advantages of icorpp while avoiding the manual efforts in bridging the commonsense reasoner and the probabilistic planner.


Author(s):  
Ming-Sheng Ying ◽  
Yuan Feng ◽  
Sheng-Gang Ying

AbstractMarkov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random. In particular, it serves as a mathematical framework for reinforcement learning. This paper introduces an extension of MDP, namely quantum MDP (qMDP), that can serve as a mathematical model of decision making about quantum systems. We develop dynamic programming algorithms for policy evaluation and finding optimal policies for qMDPs in the case of finite-horizon. The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.


2018 ◽  
Vol 15 (02) ◽  
pp. 1850011 ◽  
Author(s):  
Frano Petric ◽  
Damjan Miklić ◽  
Zdenko Kovačić

The existing procedures for autism spectrum disorder (ASD) diagnosis are often time consuming and tiresome both for highly-trained human evaluators and children, which may be alleviated by using humanoid robots in the diagnostic process. Hence, this paper proposes a framework for robot-assisted ASD evaluation based on partially observable Markov decision process (POMDP) modeling, specifically POMDPs with mixed observability (MOMDPs). POMDP is broadly used for modeling optimal sequential decision making tasks under uncertainty. Spurred by the widely accepted autism diagnostic observation schedule (ADOS), we emulate ADOS through four tasks, whose models incorporate observations of multiple social cues such as eye contact, gestures and utterances. Relying only on those observations, the robot provides an assessment of the child’s ASD-relevant functioning level (which is partially observable) within a particular task and provides human evaluators with readable information by partitioning its belief space. Finally, we evaluate the proposed MOMDP task models and demonstrate that chaining the tasks provides fine-grained outcome quantification, which could also increase the appeal of robot-assisted diagnostic protocols in the future.


2021 ◽  
pp. 1-16
Author(s):  
Pegah Alizadeh ◽  
Emiliano Traversi ◽  
Aomar Osmani

Markov Decision Process Models (MDPs) are a powerful tool for planning tasks and sequential decision-making issues. In this work we deal with MDPs with imprecise rewards, often used when dealing with situations where the data is uncertain. In this context, we provide algorithms for finding the policy that minimizes the maximum regret. To the best of our knowledge, all the regret-based methods proposed in the literature focus on providing an optimal stochastic policy. We introduce for the first time a method to calculate an optimal deterministic policy using optimization approaches. Deterministic policies are easily interpretable for users because for a given state they provide a unique choice. To better motivate the use of an exact procedure for finding a deterministic policy, we show some (theoretical and experimental) cases where the intuitive idea of using a deterministic policy obtained after “determinizing” the optimal stochastic policy leads to a policy far from the exact deterministic policy.


2012 ◽  
Vol 580 ◽  
pp. 175-179 ◽  
Author(s):  
Hong Fu Liu ◽  
Yu Zhang ◽  
Shao Fei Chen ◽  
Jing Chen

We propose a framework based on stochastic collocation to solve autonomous vehicle optimal trajectory planning problems with probabilistic uncertainty. We model uncertainty from the location and size of obstacles. We develop stochastic pseudospectral methods to solve the minimum expectation cost of differential equation, which meets path, control, and boundary constraints. Results are shown on two examples of autonomous vehicle trajectory planning under uncertainty, which illustrated the feasibility and applicability of our method.


2018 ◽  
Vol 37 (13-14) ◽  
pp. 1632-1672 ◽  
Author(s):  
Sanjiban Choudhury ◽  
Mohak Bhardwaj ◽  
Sankalp Arora ◽  
Ashish Kapoor ◽  
Gireeja Ranade ◽  
...  

Robot planning is the process of selecting a sequence of actions that optimize for a task=specific objective. For instance, the objective for a navigation task would be to find collision-free paths, whereas the objective for an exploration task would be to map unknown areas. The optimal solutions to such tasks are heavily influenced by the implicit structure in the environment, i.e. the configuration of objects in the world. State-of-the-art planning approaches, however, do not exploit this structure, thereby expending valuable effort searching the action space instead of focusing on potentially good actions. In this paper, we address the problem of enabling planners to adapt their search strategies by inferring such good actions in an efficient manner using only the information uncovered by the search up until that time. We formulate this as a problem of sequential decision making under uncertainty where at a given iteration a planning policy must map the state of the search to a planning action. Unfortunately, the training process for such partial-information-based policies is slow to converge and susceptible to poor local minima. Our key insight is that if we could fully observe the underlying world map, we would easily be able to disambiguate between good and bad actions. We hence present a novel data-driven imitation learning framework to efficiently train planning policies by imitating a clairvoyant oracle: an oracle that at train time has full knowledge about the world map and can compute optimal decisions. We leverage the fact that for planning problems, such oracles can be efficiently computed and derive performance guarantees for the learnt policy. We examine two important domains that rely on partial-information-based policies: informative path planning and search-based motion planning. We validate the approach on a spectrum of environments for both problem domains, including experiments on a real UAV, and show that the learnt policy consistently outperforms state-of-the-art algorithms. Our framework is able to train policies that achieve up to [Formula: see text] more reward than state-of-the art information-gathering heuristics and a [Formula: see text] speedup as compared with A* on search-based planning problems. Our approach paves the way forward for applying data-driven techniques to other such problem domains under the umbrella of robot planning.


2018 ◽  
Vol 24 (3) ◽  
pp. 1043-1058
Author(s):  
Nikolai Dokuchaev

The paper studies problem of optimal portfolio selection. It is shown that, under some mild conditions, near optimal strategies for investors with different performance criteria can be constructed using a limited number of fixed processes (mutual funds), for a market with a larger number of available risky stocks. This implies dimension reduction for the optimal portfolio selection problem: all rational investors may achieve optimality using the same mutual funds plus a saving account. This result is obtained under mild restrictions for the utility functions without any assumptions on regularity of the value function. The proof is based on the method of dynamic programming applied indirectly to some convenient approximations of the original problem that ensure certain regularity of the value functions. To overcome technical difficulties, we use special time dependent and random constraints for admissible strategies such that the corresponding HJB (Hamilton–Jacobi–Bellman) equation admits “almost explicit” solutions generating near optimal admissible strategies featuring sufficient regularity and integrability.


Author(s):  
Carlos Diuk ◽  
Michael Littman

Reinforcement learning (RL) deals with the problem of an agent that has to learn how to behave to maximize its utility by its interactions with an environment (Sutton & Barto, 1998; Kaelbling, Littman & Moore, 1996). Reinforcement learning problems are usually formalized as Markov Decision Processes (MDP), which consist of a finite set of states and a finite number of possible actions that the agent can perform. At any given point in time, the agent is in a certain state and picks an action. It can then observe the new state this action leads to, and receives a reward signal. The goal of the agent is to maximize its long-term reward. In this standard formalization, no particular structure or relationship between states is assumed. However, learning in environments with extremely large state spaces is infeasible without some form of generalization. Exploiting the underlying structure of a problem can effect generalization and has long been recognized as an important aspect in representing sequential decision tasks (Boutilier et al., 1999). Hierarchical Reinforcement Learning is the subfield of RL that deals with the discovery and/or exploitation of this underlying structure. Two main ideas come into play in hierarchical RL. The first one is to break a task into a hierarchy of smaller subtasks, each of which can be learned faster and easier than the whole problem. Subtasks can also be performed multiple times in the course of achieving the larger task, reusing accumulated knowledge and skills. The second idea is to use state abstraction within subtasks: not every task needs to be concerned with every aspect of the state space, so some states can actually be abstracted away and treated as the same for the purpose of the given subtask.


2019 ◽  
Vol 11 (1) ◽  
pp. 833-858 ◽  
Author(s):  
John Rust

Dynamic programming (DP) is a powerful tool for solving a wide class of sequential decision-making problems under uncertainty. In principle, it enables us to compute optimal decision rules that specify the best possible decision in any situation. This article reviews developments in DP and contrasts its revolutionary impact on economics, operations research, engineering, and artificial intelligence with the comparative paucity of its real-world applications to improve the decision making of individuals and firms. The fuzziness of many real-world decision problems and the difficulty in mathematically modeling them are key obstacles to a wider application of DP in real-world settings. Nevertheless, I discuss several success stories, and I conclude that DP offers substantial promise for improving decision making if we let go of the empirically untenable assumption of unbounded rationality and confront the challenging decision problems faced every day by individuals and firms.


Sign in / Sign up

Export Citation Format

Share Document