average reward criterion
Recently Published Documents


TOTAL DOCUMENTS

38
(FIVE YEARS 3)

H-INDEX

8
(FIVE YEARS 0)

2021 ◽  
Vol 11 (3) ◽  
pp. 1098
Author(s):  
Norbert Kozłowski ◽  
Olgierd Unold

Initially, Anticipatory Classifier Systems (ACS) were designed to address both single and multistep decision problems. In the latter case, the objective was to maximize the total discounted rewards, usually based on Q-learning algorithms. Studies on other Learning Classifier Systems (LCS) revealed many real-world sequential decision problems where the preferred objective is the maximization of the average of successive rewards. This paper proposes a relevant modification toward the learning component, allowing us to address such problems. The modified system is called AACS2 (Averaged ACS2) and is tested on three multistep benchmark problems.


2020 ◽  
Vol 22 (02) ◽  
pp. 2040002
Author(s):  
Reinoud Joosten ◽  
Llea Samuel

Games with endogenous transition probabilities and endogenous stage payoffs (or ETP–ESP games for short) are stochastic games in which both the transition probabilities and the payoffs at any stage are continuous functions of the relative frequencies of all past action combinations chosen. We present methods to compute large sets of jointly-convergent pure-strategy rewards in two-player ETP–ESP games with communicating states under the limiting average reward criterion. Such sets are useful in determining feasible rewards in a game, and instrumental in obtaining the set of (Nash) equilibrium rewards.


2020 ◽  
Vol 34 (10) ◽  
pp. 13777-13778
Author(s):  
Akshay Dharmavaram ◽  
Matthew Riemer ◽  
Shalabh Bhatnagar

Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards.


2017 ◽  
Vol 62 (11) ◽  
pp. 6032-6038 ◽  
Author(s):  
Xiaofeng Jiang ◽  
Xiaodong Wang ◽  
Hongsheng Xi ◽  
Falin Liu

2015 ◽  
Vol 52 (2) ◽  
pp. 419-440
Author(s):  
Rolando Cavazos-Cadena ◽  
Raúl Montes-De-Oca ◽  
Karel Sladký

This paper concerns discrete-time Markov decision chains with denumerable state and compact action sets. Besides standard continuity requirements, the main assumption on the model is that it admits a Lyapunov function ℓ. In this context the average reward criterion is analyzed from the sample-path point of view. The main conclusion is that if the expected average reward associated to ℓ2 is finite under any policy then a stationary policy obtained from the optimality equation in the standard way is sample-path average optimal in a strong sense.


2015 ◽  
Vol 52 (02) ◽  
pp. 419-440 ◽  
Author(s):  
Rolando Cavazos-Cadena ◽  
Raúl Montes-De-Oca ◽  
Karel Sladký

This paper concerns discrete-time Markov decision chains with denumerable state and compact action sets. Besides standard continuity requirements, the main assumption on the model is that it admits a Lyapunov function ℓ. In this context the average reward criterion is analyzed from the sample-path point of view. The main conclusion is that if the expected average reward associated to ℓ2is finite under any policy then a stationary policy obtained from the optimality equation in the standard way is sample-path average optimal in a strong sense.


2015 ◽  
Vol 17 (02) ◽  
pp. 1540014 ◽  
Author(s):  
Reinoud Joosten

We model and analyze strategic interaction over time in a duopoly. Each period the firms independently and simultaneously take two sequential decisions. First, they decide whether or not to advertise, then they set prices for goods which are imperfect substitutes. Not only the own, but also the other firm's past advertisement efforts affect the current "sales potential" of each firm. How much of this potential materializes as immediate sales, depends on current advertisement decisions. If both firms advertise, "sales potential" turns into demand, otherwise part of it "evaporates" and does not materialize. We determine feasible rewards and equilibria for the limiting average reward criterion. Uniqueness of equilibrium is by no means guaranteed, but Pareto efficiency may serve very well as a refinement criterion for wide ranges of the advertisement costs.


Sign in / Sign up

Export Citation Format

Share Document