scholarly journals General Discounting Versus Average Reward

Author(s):  
Marcus Hutter
Keyword(s):  
1981 ◽  
Vol 13 (01) ◽  
pp. 61-83 ◽  
Author(s):  
Richard Serfozo

This is a study of simple random walks, birth and death processes, and M/M/s queues that have transition probabilities and rates that are sequentially controlled at jump times of the processes. Each control action yields a one-step reward depending on the chosen probabilities or transition rates and the state of the process. The aim is to find control policies that maximize the total discounted or average reward. Conditions are given for these processes to have certain natural monotone optimal policies. Under such a policy for the M/M/s queue, for example, the service and arrival rates are non-decreasing and non-increasing functions, respectively, of the queue length. Properties of these policies and a linear program for computing them are also discussed.


2018 ◽  
Author(s):  
Hilary Don ◽  
A Ross Otto ◽  
Astin Cornwall ◽  
Tyler Davis ◽  
Darrell A. Worthy

Learning about reward and expected values of choice alternatives is critical for adaptive behavior. Although human choice is affected by the presentation frequency of reward-related alternatives, this is overlooked by some dominant models of value learning. For instance, the delta rule learns average rewards, whereas the decay rule learns cumulative rewards for each option. In a binary-outcome choice task, participants selected between pairs of options that had reward probabilities of .65 (A) versus .35 (B) or .75 (C) versus .25 (D). Crucially, during training there were twice as many AB trials as CD trials, therefore option A was associated with higher cumulative reward, while option C gave higher average reward. Participants then decided between novel combinations of options (e.g., AC). Participants preferred option A, a result predicted by the Decay model, but not the Delta model. This suggests that expected values are based more on total reward than average reward.


2021 ◽  
Author(s):  
Sam Hall-McMaster ◽  
Peter Dayan ◽  
Nicolas W. Schuck

SummaryForaging is a common decision problem in natural environments. When new exploitable sites are always available, a simple optimal strategy is to leave a current site when its return falls below a single average reward rate. Here, we examined foraging in a more structured environment, with a limited number of sites that replenished at different rates and had to be revisited. When participants could choose sites, they visited fast-replenishing sites more often, left sites at higher levels of reward, and achieved a higher net reward rate. Decisions to exploit-or-leave a site were best explained with a computational model estimating separate reward rates for each site. This suggests option-specific information can be used to construct a threshold for patch leaving in some foraging settings, rather than a single average reward rate.


Author(s):  
Jerzy Filar ◽  
Koos Vrieze

Sign in / Sign up

Export Citation Format

Share Document