The variance of discounted Markov decision processes

Formulae are presented for the variance and higher moments of the present value of single-stage rewards in a finite Markov decision process. Similar formulae are exhibited for a semi-Markov decision process. There is a short discussion of the obstacles to using the variance formula in algorithms to maximize the mean minus a multiple of the standard deviation.

Download Full-text

Discounted Cost Markov Decision Processes with a Constraint

Probability in the Engineering and Informational Sciences ◽

10.1017/s0269964800005131 ◽

1998 ◽

Vol 12 (2) ◽

pp. 177-187 ◽

Cited By ~ 3

Author(s):

Kazuyoshi Wakuta

Keyword(s):

Markov Decision Process ◽

Markov Decision Processes ◽

Decision Process ◽

Decision Processes ◽

Discounted Cost ◽

Markov Decision ◽

Vector Valued

We consider a discounted cost Markov decision process with a constraint. Relating this to a vector-valued Markov decision process, we prove that there exists a constrained optimal randomized semistationary policy if there exists at least one policy satisfying a constraint. Moreover, we present an algorithm by which we can find the constrained optimal randomized semistationary policy, or we can discover that there exist no policies satisfying a given constraint.

Download Full-text

Quantile Markov Decision Processes

Operations Research ◽

10.1287/opre.2021.2123 ◽

2021 ◽

Author(s):

Xiaocheng Li ◽

Huaiyang Zhong ◽

Margaret L. Brandeau

Keyword(s):

Markov Decision Process ◽

Markov Decision Processes ◽

Decision Process ◽

Value At Risk ◽

Infinite Horizon ◽

Decision Processes ◽

Conditional Value At Risk ◽

Sequential Decision ◽

Optimal Drug ◽

Markov Decision

Title: Sequential Decision Making Using Quantiles The goal of a traditional Markov decision process (MDP) is to maximize the expectation of cumulative reward over a finite or infinite horizon. In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward. For example, a physician may want to determine the optimal drug regime for a risk-averse patient with the objective of maximizing the 0.10 quantile of the cumulative reward; this is the cumulative improvement in health that is expected to occur with at least 90% probability for the patient. In “Quantile Markov Decision Processes,” X. Li, H. Zhong, and M. Brandeau provide analytic results to solve the quantile Markov decision process (QMDP) problem. They develop an efficient dynamic programming procedure that finds the optimal QMDP value function for all states and quantiles in one pass. The algorithm also extends to the MDP problem with a conditional value-at-risk objective.

Download Full-text

Adaptive control of M/M/1 queues—continuous-time Markov decision process approach

Journal of Applied Probability ◽

10.1017/s0021900200023512 ◽

1983 ◽

Vol 20 (02) ◽

pp. 368-379

Author(s):

Lam Yeh ◽

L. C. Thomas

Keyword(s):

Adaptive Control ◽

Markov Decision Process ◽

Markov Decision Processes ◽

Optimal Policy ◽

Continuous Time ◽

Decision Process ◽

Process Approach ◽

Decision Processes ◽

Markov Decision ◽

Discounted Costs

By considering continuous-time Markov decision processes where decisions can be made at any time, we show in the case of M/M/1 queues with discounted costs that there exists a monotone optimal policy among all the regular policies.

Download Full-text

Adaptive control of M/M/1 queues—continuous-time Markov decision process approach

Journal of Applied Probability ◽

10.2307/3213809 ◽

1983 ◽

Vol 20 (2) ◽

pp. 368-379 ◽

Cited By ~ 6

Author(s):

Lam Yeh ◽

L. C. Thomas

Keyword(s):

Adaptive Control ◽

Markov Decision Process ◽

Markov Decision Processes ◽

Optimal Policy ◽

Continuous Time ◽

Decision Process ◽

Process Approach ◽

Decision Processes ◽

Markov Decision ◽

Discounted Costs

By considering continuous-time Markov decision processes where decisions can be made at any time, we show in the case of M/M/1 queues with discounted costs that there exists a monotone optimal policy among all the regular policies.

Download Full-text

A Moreau-Yosida regularization for Markov decision processes

Proyecciones (Antofagasta) ◽

10.22199/issn.0717-6279-2021-01-0008 ◽

2020 ◽

Vol 40 (1) ◽

pp. 117-137

Author(s):

R. Israel Ortega-Gutiérrez ◽

H. Cruz-Suárez

Keyword(s):

Markov Decision Process ◽

Markov Decision Processes ◽

Optimal Policy ◽

Decision Process ◽

Value Function ◽

Decision Processes ◽

Original Process ◽

Optimal Value ◽

Markov Decision ◽

Yosida Regularization

This paper addresses a class of sequential optimization problems known as Markov decision processes. These kinds of processes are considered on Euclidean state and action spaces with the total expected discounted cost as the objective function. The main goal of the paper is to provide conditions to guarantee an adequate Moreau-Yosida regularization for Markov decision processes (named the original process). In this way, a new Markov decision process that conforms to the Markov control model of the original process except for the cost function induced via the Moreau-Yosida regularization is established. Compared to the original process, this new discounted Markov decision process has richer properties, such as the differentiability of its optimal value function, strictly convexity of the value function, uniqueness of optimal policy, and the optimal value function and the optimal policy of both processes, are the same. To complement the theory presented, an example is provided.

Download Full-text

Uniformization for semi-Markov decision processes under stationary policies

Journal of Applied Probability ◽

10.1017/s0021900200031375 ◽

1987 ◽

Vol 24 (03) ◽

pp. 644-656 ◽

Cited By ~ 4

Author(s):

Frederick J. Beutler ◽

Keith W. Ross

Keyword(s):

Markov Decision Processes ◽

Continuous Time ◽

Decision Process ◽

Decision Processes ◽

Stationary Processes ◽

Original Process ◽

Markov Decision ◽

Time Optimal ◽

Randomized Policies ◽

Average Rewards

Uniformization permits the replacement of a semi-Markov decision process (SMDP) by a Markov chain exhibiting the same average rewards for simple (non-randomized) policies. It is shown that various anomalies may occur, especially for stationary (randomized) policies; uniformization introduces virtual jumps with concomitant action changes not present in the original process. Since these lead to discrepancies in the average rewards for stationary processes, uniformization can be accepted as valid only for simple policies. We generalize uniformization to yield consistent results for stationary policies also. These results are applied to constrained optimization of SMDP, in which stationary (randomized) policies appear naturally. The structure of optimal constrained SMDP policies can then be elucidated by studying the corresponding controlled Markov chains. Moreover, constrained SMDP optimal policy computations can be more easily implemented in discrete time, the generalized uniformization being employed to relate discrete- and continuous-time optimal constrained policies.

Download Full-text

Note on discounted continuous-time Markov decision processes with a lower bounding function

Journal of Applied Probability ◽

10.1017/jpr.2017.53 ◽

2017 ◽

Vol 54 (4) ◽

pp. 1071-1088

Author(s):

Xin Guo ◽

Alexey Piunovskiy ◽

Yi Zhang

Keyword(s):

Markov Decision Processes ◽

Continuous Time ◽

Decision Process ◽

Decision Processes ◽

Positive Part ◽

Negative Part ◽

Cost Rate ◽

Lower Bounding ◽

Markov Decision ◽

Unconstrained Problems

AbstractWe consider the discounted continuous-time Markov decision process (CTMDP), where the negative part of each cost rate is bounded by a drift function, sayw, whereas the positive part is allowed to be arbitrarily unbounded. Our focus is on the existence of a stationary optimal policy for the discounted CTMDP problems out of the more general class. Both constrained and unconstrained problems are considered. Our investigations are based on the continuous-time version of the Veinott transformation. This technique has not been widely employed in the previous literature on CTMDPs, but it clarifies the roles of the imposed conditions in a rather transparent way.

Download Full-text

Investment and operational decisions for start-up companies: a game theory and Markov decision process approach

Annals of Operations Research ◽

10.1007/s10479-019-03426-5 ◽

2019 ◽

Cited By ~ 1

Author(s):

Thomas W. Archibald ◽

Edgar Possani

Keyword(s):

Game Theory ◽

Markov Decision Process ◽

Decision Process ◽

Net Present Value ◽

Process Approach ◽

Present Value ◽

Start Up ◽

Markov Decision ◽

Zero Sum ◽

Theoretical Results

Abstract This paper analyses the contract between an entrepreneur and an investor, using a non-zero sum game in which the entrepreneur is interested in company survival and the investor in maximizing expected net present value. Theoretical results are given and the model’s usefulness is exemplified using simulations. We have observed that both the entrepreneur and the investor are better off under a contract which involves repayments and a share of the start-up company. We also have observed that the entrepreneur will choose riskier actions as the repayments become harder to meet up to a level where the company is no longer able to survive.

Download Full-text

Efficient PAC Reinforcement Learning in Regular Decision Processes

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/279 ◽

2021 ◽

Author(s):

Alessandro Ronca ◽

Giuseppe De Giacomo

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Polynomial Time ◽

Optimal Policy ◽

Decision Process ◽

Transition Function ◽

Decision Processes ◽

Reward Function ◽

Markov Decision ◽

Reward Functions

Recently regular decision processes have been proposed as a well-behaved form of non-Markov decision process. Regular decision processes are characterised by a transition function and a reward function that depend on the whole history, though regularly (as in regular languages). In practice both the transition and the reward functions can be seen as finite transducers. We study reinforcement learning in regular decision processes. Our main contribution is to show that a near-optimal policy can be PAC-learned in polynomial time in a set of parameters that describe the underlying decision process. We argue that the identified set of parameters is minimal and it reasonably captures the difficulty of a regular decision process.

Download Full-text