continuous reward
Recently Published Documents


TOTAL DOCUMENTS

17
(FIVE YEARS 3)

H-INDEX

5
(FIVE YEARS 1)

Author(s):  
Xiong Wang ◽  
Riheng Jia

Mean field game facilitates analyzing multi-armed bandit (MAB) for a large number of agents by approximating their interactions with an average effect. Existing mean field models for multi-agent MAB mostly assume a binary reward function, which leads to tractable analysis but is usually not applicable in practical scenarios. In this paper, we study the mean field bandit game with a continuous reward function. Specifically, we focus on deriving the existence and uniqueness of mean field equilibrium (MFE), thereby guaranteeing the asymptotic stability of the multi-agent system. To accommodate the continuous reward function, we encode the learned reward into an agent state, which is in turn mapped to its stochastic arm playing policy and updated using realized observations. We show that the state evolution is upper semi-continuous, based on which the existence of MFE is obtained. As the Markov analysis is mainly for the case of discrete state, we transform the stochastic continuous state evolution into a deterministic ordinary differential equation (ODE). On this basis, we can characterize a contraction mapping for the ODE to ensure a unique MFE for the bandit game. Extensive evaluations validate our MFE characterization, and exhibit tight empirical regret of the MAB problem.


2019 ◽  
Vol 51 (01) ◽  
pp. 87-115
Author(s):  
Yi-Shen Lin ◽  
Yi-Ching Yao

AbstractIn the literature on optimal stopping, the problem of maximizing the expected discounted reward over all stopping times has been explicitly solved for some special reward functions (including (x+)ν, (ex − K)+, (K − e− x)+, x ∈ ℝ, ν ∈ (0, ∞), and K > 0) under general random walks in discrete time and Lévy processes in continuous time (subject to mild integrability conditions). All such reward functions are continuous, increasing, and logconcave while the corresponding optimal stopping times are of threshold type (i.e. the solutions are one-sided). In this paper we show that all optimal stopping problems with increasing, logconcave, and right-continuous reward functions admit one-sided solutions for general random walks and Lévy processes, thereby generalizing the aforementioned results. We also investigate in detail the principle of smooth fit for Lévy processes when the reward function is increasing and logconcave.


2014 ◽  
Vol 2014 ◽  
pp. 1-10 ◽  
Author(s):  
Yaofei Ma ◽  
Xiaole Ma ◽  
Xiao Song

As a continuous state space problem, air combat is difficult to be resolved by traditional dynamic programming (DP) with discretized state space. The approximated dynamic programming (ADP) approach is studied in this paper to build a high performance decision model for air combat in 1 versus 1 scenario, in which the iterative process for policy improvement is replaced by mass sampling from history trajectories and utility function approximating, leading to high efficiency on policy improvement eventually. A continuous reward function is also constructed to better guide the plane to find its way to “winner” state from any initial situation. According to our experiments, the plane is more offensive when following policy derived from ADP approach other than the baseline Min-Max policy, in which the “time to win” is reduced greatly but the cumulated probability of being killed by enemy is higher. The reason is analyzed in this paper.


1976 ◽  
Vol 28 (4) ◽  
pp. 633-642 ◽  
Author(s):  
R. G. M. Morris ◽  
D. F. Einon ◽  
M. J. Morgan

Four groups of rats were trained to run an alleyway with one trial per day. Two groups were always deprived when trained while the other two received a partial deprivation schedule. One group of each pair received a continuous reward in the goal box while the other received partial reward. A partial reinforcement effect was found during extinction. The partially deprived groups also showed persistence in extinction. This result extends parallels between the effects of satiation and nonreward upon behaviour.


1975 ◽  
Vol 36 (2) ◽  
pp. 659-669
Author(s):  
Jeffrey A. Goldman

The study was designed to test three hypotheses of the small-trials partial-reinforcement effect. Seventy-two hooded rats received either continuous reward or partial reward for 6 acquisition trials in a runway. Prior to acquisition, an equal number of Ss in each schedule condition were given 0, 5, or 10 rewarded goal-box placements. A strong small-trials partial-reinforcement effect was evident. There was an effect of placements on both initial and terminal acquisition-performance, indicating that rg–sg was preconditioned to goal-box cues. However, this preconditioned rg–sg failed to influence persistence of responding during extinction. The extinction results are contrary to two frustration analyses but do not contradict an analysis of aftereffects.


1974 ◽  
Vol 34 (3) ◽  
pp. 799-809 ◽  
Author(s):  
Charles L. Goodrick

Rats were trained on a continuous reward schedule (CRF) in a 2-bar test box with or without light contingency associated with food reward. During extinction trials with the light contingency presented on CRF, the light contingency was shown to facilitate extinction on the basis of response totals (energizing effect) and percentage of responses on the reward bar (directional effect). For these extinction trials, no evidence was obtained for secondary reinforcing (Sr) properties of the light contingency. However, a second experiment found a strong Sr energizing effect when the light contingency was presented on a FR-5 or FR-10 schedule during extinction trials. These energizing effects were also obtained for rats trained without the light contingency, but to a lesser degree than for rats trained with the light contingency. Presentation of the light contingency on a FR schedule during extinction trials also resulted in a strong directional effect for all groups, regardless of the training condition. A moderately durable energizing effect of Sr was obtained which was an increasing function of partial periodic Sr during extinction trials.


1973 ◽  
Vol 37 (2) ◽  
pp. 669-670
Author(s):  
Charles S. Hayes ◽  
Cathie Siders ◽  
Bill Snider

30 youngsters from classes for the moderately retarded received partial or continuous reward for 20 lever-pulling trials. Contrary to predictions based on frustrative nonreward theory, neither speed of lever pulls nor free field activity following each trial differentiated reward groups.


1973 ◽  
Vol 2 (2) ◽  
pp. 103-104 ◽  
Author(s):  
Richard S. Calef ◽  
David C. Hopkins ◽  
Earl R. McHewitt ◽  
Frederick R. Maxwell
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document