An Algorithm for Making Regime-Changing Markov Decisions

In industrial applications, the processes of optimal sequential decision making are naturally formulated and optimized within a standard setting of Markov decision theory. In practice, however, decisions must be made under incomplete and uncertain information about parameters and transition probabilities. This situation occurs when a system may suffer a regime switch changing not only the transition probabilities but also the control costs. After such an event, the effect of the actions may turn to the opposite, meaning that all strategies must be revised. Due to practical importance of this problem, a variety of methods has been suggested, ranging from incorporating regime switches into Markov dynamics to numerous concepts addressing model uncertainty. In this work, we suggest a pragmatic and practical approach using a natural re-formulation of this problem as a so-called convex switching system, we make efficient numerical algorithms applicable.

Download Full-text

Optimal Policies for Quantum Markov Decision Processes

International Journal of Automation and Computing ◽

10.1007/s11633-021-1278-z ◽

2021 ◽

Author(s):

Ming-Sheng Ying ◽

Yuan Feng ◽

Sheng-Gang Ying

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Quantum Systems ◽

Sequential Decision Making ◽

Mathematical Framework ◽

Sequential Decision ◽

Learning Techniques ◽

Optimal Policies ◽

Markov Decision ◽

Programming Algorithms

AbstractMarkov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random. In particular, it serves as a mathematical framework for reinforcement learning. This paper introduces an extension of MDP, namely quantum MDP (qMDP), that can serve as a mathematical model of decision making about quantum systems. We develop dynamic programming algorithms for policy evaluation and finding optimal policies for qMDPs in the case of finite-horizon. The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.

Download Full-text

Deterministic policies based on maximum regrets in MDPs with imprecise rewards

AI Communications ◽

10.3233/aic-190632 ◽

2021 ◽

pp. 1-16

Author(s):

Pegah Alizadeh ◽

Emiliano Traversi ◽

Aomar Osmani

Keyword(s):

Decision Making ◽

Decision Process ◽

Process Models ◽

Sequential Decision Making ◽

Sequential Decision ◽

Exact Procedure ◽

Markov Decision ◽

Intuitive Idea ◽

First Time ◽

Maximum Regret

Markov Decision Process Models (MDPs) are a powerful tool for planning tasks and sequential decision-making issues. In this work we deal with MDPs with imprecise rewards, often used when dealing with situations where the data is uncertain. In this context, we provide algorithms for finding the policy that minimizes the maximum regret. To the best of our knowledge, all the regret-based methods proposed in the literature focus on providing an optimal stochastic policy. We introduce for the first time a method to calculate an optimal deterministic policy using optimization approaches. Deterministic policies are easily interpretable for users because for a given state they provide a unique choice. To better motivate the use of an exact procedure for finding a deterministic policy, we show some (theoretical and experimental) cases where the intuitive idea of using a deterministic policy obtained after “determinizing” the optimal stochastic policy leads to a policy far from the exact deterministic policy.

Download Full-text

A Novel Heterogeneous Swarm Reinforcement Learning Method for Sequential Decision Making Problems

Machine Learning and Knowledge Extraction ◽

10.3390/make1020035 ◽

2019 ◽

Vol 1 (2) ◽

pp. 590-610

Author(s):

Zohreh Akbari ◽

Rainer Unland

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Single Agent ◽

Sequential Decision Making ◽

Multi Agent Systems ◽

Sequential Decision ◽

Agent Systems ◽

Novel Approach ◽

Markov Decision ◽

Multi Agent

Sequential Decision Making Problems (SDMPs) that can be modeled as Markov Decision Processes can be solved using methods that combine Dynamic Programming (DP) and Reinforcement Learning (RL). Depending on the problem scenarios and the available Decision Makers (DMs), such RL algorithms may be designed for single-agent systems or multi-agent systems that either consist of agents with individual goals and decision making capabilities, which are influenced by other agent’s decisions, or behave as a swarm of agents that collaboratively learn a single objective. Many studies have been conducted in this area; however, when concentrating on available swarm RL algorithms, one obtains a clear view of the areas that still require attention. Most of the studies in this area focus on homogeneous swarms and so far, systems introduced as Heterogeneous Swarms (HetSs) merely include very few, i.e., two or three sub-swarms of homogeneous agents, which either, according to their capabilities, deal with a specific sub-problem of the general problem or exhibit different behaviors in order to reduce the risk of bias. This study introduces a novel approach that allows agents, which are originally designed to solve different problems and hence have higher degrees of heterogeneity, to behave as a swarm when addressing identical sub-problems. In fact, the affinity between two agents, which measures the compatibility of agents to work together towards solving a specific sub-problem, is used in designing a Heterogeneous Swarm RL (HetSRL) algorithm that allows HetSs to solve the intended SDMPs.

Download Full-text

Online Planning Algorithms for POMDPs

Journal of Artificial Intelligence Research ◽

10.1613/jair.2567 ◽

2008 ◽

Vol 32 ◽

pp. 663-704 ◽

Cited By ~ 165

Author(s):

S. Ross ◽

J. Pineau ◽

S. Paquet ◽

B. Chaib-draa

Keyword(s):

Heuristic Search ◽

State Of The Art ◽

Sequential Decision Making ◽

Sequential Decision ◽

Time Step ◽

Advantages And Disadvantages ◽

Online Planning ◽

Markov Decision ◽

Heuristic Search Methods ◽

Partially Observable

Partially Observable Markov Decision Processes (POMDPs) provide a rich framework for sequential decision-making under uncertainty in stochastic domains. However, solving a POMDP is often intractable except for small problems due to their complexity. Here, we focus on online approaches that alleviate the computational complexity by computing good local policies at each decision step during the execution. Online algorithms generally consist of a lookahead search to find the best action to execute at each time step in an environment. Our objectives here are to survey the various existing online POMDP methods, analyze their properties and discuss their advantages and disadvantages; and to thoroughly evaluate these online approaches in different environments under various metrics (return, error bound reduction, lower bound improvement). Our experimental results indicate that state-of-the-art online heuristic search methods can handle large POMDP domains efficiently.

Download Full-text

Teaching Machines to Extract Main Content for Machine Reading Comprehension

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019973 ◽

2019 ◽

Vol 33 ◽

pp. 9973-9974

Author(s):

Zhaohui Li ◽

Yue Feng ◽

Jun Xu ◽

Jiafeng Guo ◽

Yanyan Lan ◽

...

Keyword(s):

Reading Comprehension ◽

Sequential Decision Making ◽

Model Parameters ◽

Sequential Decision ◽

Proposed Model ◽

Policy Gradient ◽

Markov Decision ◽

Content Identification ◽

Machine Reading ◽

Teaching Machines

Machine reading comprehension, whose goal is to find answers from the candidate passages for a given question, has attracted a lot of research efforts in recent years. One of the key challenge in machine reading comprehension is how to identify the main content from a large, redundant, and overlapping set of candidate sentences. In this paper we propose to tackle the challenge with Markov Decision Process in which the main content identification is formalized as sequential decision making and each action corresponds to selecting a sentence. Policy gradient is used to learn the model parameters. Experimental results based on MSMARCO showed that the proposed model, called MC-MDP, can select high quality main contents and significantly improved the performances of answer span prediction.

Download Full-text

Hidden-Mode Markov Decision Processes for Nonstationary Sequential Decision Making

Sequence Learning - Lecture Notes in Computer Science ◽

10.1007/3-540-44565-x_12 ◽

2000 ◽

pp. 264-287 ◽

Cited By ~ 7

Author(s):

Samuel P. M. Choi ◽

Dit-Yan Yeung ◽

Nevin L. Zhang

Keyword(s):

Decision Making ◽

Markov Decision Processes ◽

Decision Processes ◽

Sequential Decision Making ◽

Sequential Decision ◽

Markov Decision

Download Full-text

Markov Decision Processes: A Tool for Sequential Decision Making under Uncertainty

Medical Decision Making ◽

10.1177/0272989x09353194 ◽

2009 ◽

Vol 30 (4) ◽

pp. 474-483 ◽

Cited By ~ 100

Author(s):

Oguzhan Alagoz ◽

Heather Hsu ◽

Andrew J. Schaefer ◽

Mark S. Roberts

Keyword(s):

Decision Making ◽

Markov Decision Processes ◽

Living Donor ◽

Decision Processes ◽

Medical Decision ◽

Sequential Decision Making ◽

Optimal Timing ◽

Decision Making Under Uncertainty ◽

Sequential Decision ◽

Markov Decision

We provide a tutorial on the construction and evaluation of Markov decision processes (MDPs), which are powerful analytical tools used for sequential decision making under uncertainty that have been widely used in many industrial and manufacturing applications but are underutilized in medical decision making (MDM). We demonstrate the use of an MDP to solve a sequential clinical treatment problem under uncertainty. Markov decision processes generalize standard Markov models in that a decision process is embedded in the model and multiple decisions are made over time. Furthermore, they have significant advantages over standard decision analysis. We compare MDPs to standard Markov-based simulation models by solving the problem of the optimal timing of living-donor liver transplantation using both methods. Both models result in the same optimal transplantation policy and the same total life expectancies for the same patient and living donor. The computation time for solving the MDP model is significantly smaller than that for solving the Markov model. We briefly describe the growing literature of MDPs applied to medical decisions.

Download Full-text

Risk-Sensitive Reinforcement Learning

Neural Computation ◽

10.1162/neco_a_00600 ◽

2014 ◽

Vol 26 (7) ◽

pp. 1298-1328 ◽

Cited By ~ 19

Author(s):

Yun Shen ◽

Michael J. Tobia ◽

Tobias Sommer ◽

Klaus Obermayer

Keyword(s):

Reinforcement Learning ◽

Prospect Theory ◽

Human Behavior ◽

Transition Probabilities ◽

Learning Algorithm ◽

Sequential Decision ◽

Uncertain Environments ◽

Risk Sensitive ◽

Markov Decision ◽

Q Values

We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents’ behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

Download Full-text

An Introduction to Fully and Partially Observable Markov Decision Processes

Decision Theory Models for Applications in Artificial Intelligence ◽

10.4018/978-1-60960-165-2.ch003 ◽

2012 ◽

pp. 33-62 ◽

Cited By ~ 2

Author(s):

Pascal Poupart

Keyword(s):

Decision Making ◽

Markov Decision Processes ◽

Decision Processes ◽

Sequential Decision Making ◽

Decision Making Under Uncertainty ◽

Sequential Decision ◽

Markov Decision ◽

The Common ◽

Partially Observable Markov ◽

Partially Observable

The goal of this chapter is to provide an introduction to Markov decision processes as a framework for sequential decision making under uncertainty. The aim of this introduction is to provide practitioners with a basic understanding of the common modeling and solution techniques. Hence, we will not delve into the details of the most recent algorithms, but rather focus on the main concepts and the issues that impact deployment in practice. More precisely, we will review fully and partially observable Markov decision processes, describe basic algorithms to find good policies and discuss modeling/computational issues that arise in practice.

Download Full-text

Risk Sensitive Probabilistic Planning with ILAO* and Exponential Utility Function

10.5753/eniac.2018.4434 ◽

2018 ◽

Author(s):

Elthon Manhas De Freitas ◽

Karina Valdivia Delgado ◽

Valdinei Freire

Keyword(s):

Heuristic Search ◽

Search Algorithm ◽

Decision Processes ◽

Sequential Decision Making ◽

Sequential Decision ◽

Initial State ◽

Heuristic Search Algorithm ◽

Risk Sensitive ◽

Markov Decision ◽

Exponential Utility Function

Markov Decision Process (MDP) has been used very efficiently to solve sequential decision-making problems. However, there are problems in which dealing with the risks of the environment to obtain a reliable result is more important than minimizing the total expected cost. MDPs that deal with this type of problem are called risk-sensitive Markov decision processes (RSMDP). In this paper we propose an efficient heuristic search algorithm that allows to obtain a solution by evaluating only the relevant states to reach the goal states starting from an initial state.

Download Full-text