DISCOUNTING AND AVERAGING IN GAMES ACROSS TIME SCALES

2012 ◽  
Vol 23 (03) ◽  
pp. 609-625 ◽  
Author(s):  
KRISHNENDU CHATTERJEE ◽  
RUPAK MAJUMDAR

We introduce two-level discounted and mean-payoff games played by two players on a perfect-information stochastic game graph. The upper level game is a discounted or mean-payoff game and the lower level game is a (undiscounted) reachability game. Two-level games model hierarchical and sequential decision making under uncertainty across different time scales. For both discounted and mean-payoff two-level games, we show the existence of pure memoryless optimal strategies for both players and an ordered field property. We show that if there is only one player (Markov decision processes), then the values can be computed in polynomial time. It follows that whether the value of a player is equal to a given rational constant in two-level discounted or mean-payoff games can be decided in NP ∩ coNP . We also give an alternate strategy improvement algorithm to compute the value.

2013 ◽  
Vol 15 (04) ◽  
pp. 1340026 ◽  
Author(s):  
PRASENJIT MONDAL ◽  
SAGNIK SINHA

In this paper, we deal with a subclass of two-person finite SeR-SIT (Separable Reward-State Independent Transition) semi-Markov games which can be solved by solving a single matrix/bimatrix game under discounted as well as limiting average (undiscounted) payoff criteria. A SeR-SIT semi-Markov game does not satisfy the so-called (Archimedean) ordered field property in general. Besides, the ordered field property does not hold even for a SeR-SIT-PT (Separable Reward-State-Independent Transition Probability and Time) semi-Markov game, which is a natural version of a SeR-SIT stochastic (Markov) game. However by using an additional condition, we have shown that a subclass of finite SeR-SIT-PT semi-Markov games have the ordered field property for both discounted and undiscounted semi-Markov games with both players having state-independent stationary optimals. The ordered field property also holds for the nonzero-sum case under the same assumptions. We find a relation between the values of the discounted and the undiscounted zero-sum semi-Markov games for this modified subclass. We propose a more realistic pollution tax model for this subclass of SeR-SIT semi-Markov games than pollution tax model for SeR-SIT stochastic game. Finite step algorithms are given for the discounted and for the zero-sum undiscounted cases.


2012 ◽  
Vol 23 (03) ◽  
pp. 687-711 ◽  
Author(s):  
HUGO GIMBERT ◽  
WIESŁAW ZIELONKA

We examine perfect information stochastic mean-payoff games – a class of games containing as special sub-classes the usual mean-payoff games and parity games. We show that deterministic memoryless strategies that are optimal for discounted games with state-dependent discount factors close to 1 are optimal for priority mean-payoff games establishing a strong link between these two classes.


Author(s):  
Ming-Sheng Ying ◽  
Yuan Feng ◽  
Sheng-Gang Ying

AbstractMarkov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random. In particular, it serves as a mathematical framework for reinforcement learning. This paper introduces an extension of MDP, namely quantum MDP (qMDP), that can serve as a mathematical model of decision making about quantum systems. We develop dynamic programming algorithms for policy evaluation and finding optimal policies for qMDPs in the case of finite-horizon. The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.


2012 ◽  
Vol 25 (20) ◽  
pp. 6975-6988 ◽  
Author(s):  
Jung-Eun Chu ◽  
Saji N. Hameed ◽  
Kyung-Ja Ha

Abstract The hypothesis that regional characteristics of the East Asian summer monsoon (EASM) result from the presence of nonlinear coupled features that modulate the seasonal circulation and rainfall at the intraseasonal time scale is advanced in this study. To examine this hypothesis, the authors undertake the analysis of daily EASM variability using a nonlinear multivariate data classifying algorithm known as self-organizing mapping (SOM). On the basis of various SOM node analyses, four major intraseasonal phases of the EASM are identified. The first node describes a circulation state corresponding to weak tropical and subtropical pressure systems, strong upper-level jets, weakened monsoonal winds, and cyclonic upper-level vorticity. This mode, related to large rainfall anomalies in southeast China and southern Japan, is identified as the mei-yu–baiu phase. The second node represents a distinct circulation state corresponding to a strengthened subtropical high, monsoonal winds, and anticyclonic upper-level vorticity in southeast Korea, which is identified as the changma phase. The third node is related to copious rain over Korea following changma, which we name the postchangma phase. The fourth node is situated diagonally opposite the changma mode. Because Korea experiences a dry spell associated with this SOM node, it is referred to as the dry-spell phase. The authors also demonstrate that a strong modulation of the changma and dry-spell phases on interannual time scales occurs during El Niño and La Niña years. Results imply that the key to predictability of the EASM on interannual time scales may lie with analysis and exploitation of its nonlinear characteristics.


2018 ◽  
Vol 15 (02) ◽  
pp. 1850011 ◽  
Author(s):  
Frano Petric ◽  
Damjan Miklić ◽  
Zdenko Kovačić

The existing procedures for autism spectrum disorder (ASD) diagnosis are often time consuming and tiresome both for highly-trained human evaluators and children, which may be alleviated by using humanoid robots in the diagnostic process. Hence, this paper proposes a framework for robot-assisted ASD evaluation based on partially observable Markov decision process (POMDP) modeling, specifically POMDPs with mixed observability (MOMDPs). POMDP is broadly used for modeling optimal sequential decision making tasks under uncertainty. Spurred by the widely accepted autism diagnostic observation schedule (ADOS), we emulate ADOS through four tasks, whose models incorporate observations of multiple social cues such as eye contact, gestures and utterances. Relying only on those observations, the robot provides an assessment of the child’s ASD-relevant functioning level (which is partially observable) within a particular task and provides human evaluators with readable information by partitioning its belief space. Finally, we evaluate the proposed MOMDP task models and demonstrate that chaining the tasks provides fine-grained outcome quantification, which could also increase the appeal of robot-assisted diagnostic protocols in the future.


2021 ◽  
pp. 1-16
Author(s):  
Pegah Alizadeh ◽  
Emiliano Traversi ◽  
Aomar Osmani

Markov Decision Process Models (MDPs) are a powerful tool for planning tasks and sequential decision-making issues. In this work we deal with MDPs with imprecise rewards, often used when dealing with situations where the data is uncertain. In this context, we provide algorithms for finding the policy that minimizes the maximum regret. To the best of our knowledge, all the regret-based methods proposed in the literature focus on providing an optimal stochastic policy. We introduce for the first time a method to calculate an optimal deterministic policy using optimization approaches. Deterministic policies are easily interpretable for users because for a given state they provide a unique choice. To better motivate the use of an exact procedure for finding a deterministic policy, we show some (theoretical and experimental) cases where the intuitive idea of using a deterministic policy obtained after “determinizing” the optimal stochastic policy leads to a policy far from the exact deterministic policy.


Author(s):  
Carlos Diuk ◽  
Michael Littman

Reinforcement learning (RL) deals with the problem of an agent that has to learn how to behave to maximize its utility by its interactions with an environment (Sutton & Barto, 1998; Kaelbling, Littman & Moore, 1996). Reinforcement learning problems are usually formalized as Markov Decision Processes (MDP), which consist of a finite set of states and a finite number of possible actions that the agent can perform. At any given point in time, the agent is in a certain state and picks an action. It can then observe the new state this action leads to, and receives a reward signal. The goal of the agent is to maximize its long-term reward. In this standard formalization, no particular structure or relationship between states is assumed. However, learning in environments with extremely large state spaces is infeasible without some form of generalization. Exploiting the underlying structure of a problem can effect generalization and has long been recognized as an important aspect in representing sequential decision tasks (Boutilier et al., 1999). Hierarchical Reinforcement Learning is the subfield of RL that deals with the discovery and/or exploitation of this underlying structure. Two main ideas come into play in hierarchical RL. The first one is to break a task into a hierarchy of smaller subtasks, each of which can be learned faster and easier than the whole problem. Subtasks can also be performed multiple times in the course of achieving the larger task, reusing accumulated knowledge and skills. The second idea is to use state abstraction within subtasks: not every task needs to be concerned with every aspect of the state space, so some states can actually be abstracted away and treated as the same for the purpose of the given subtask.


Sign in / Sign up

Export Citation Format

Share Document