A learning algorithm for the finite-time two-armed bandit problem

1984 ◽  
Vol SMC-14 (3) ◽  
pp. 528-534
Author(s):  
M. Sato ◽  
K. Abe ◽  
H. Takeda
2020 ◽  
Vol 34 (04) ◽  
pp. 6518-6525
Author(s):  
Xiao Xu ◽  
Fang Dong ◽  
Yanghua Li ◽  
Shaojian He ◽  
Xin Li

A contextual bandit problem is studied in a highly non-stationary environment, which is ubiquitous in various recommender systems due to the time-varying interests of users. Two models with disjoint and hybrid payoffs are considered to characterize the phenomenon that users' preferences towards different items vary differently over time. In the disjoint payoff model, the reward of playing an arm is determined by an arm-specific preference vector, which is piecewise-stationary with asynchronous and distinct changes across different arms. An efficient learning algorithm that is adaptive to abrupt reward changes is proposed and theoretical regret analysis is provided to show that a sublinear scaling of regret in the time length T is achieved. The algorithm is further extended to a more general setting with hybrid payoffs where the reward of playing an arm is determined by both an arm-specific preference vector and a joint coefficient vector shared by all arms. Empirical experiments are conducted on real-world datasets to verify the advantages of the proposed learning algorithms against baseline ones in both settings.


2020 ◽  
Vol 34 (04) ◽  
pp. 5379-5386
Author(s):  
Vishakha Patil ◽  
Ganesh Ghalme ◽  
Vineet Nair ◽  
Y. Narahari

We study an interesting variant of the stochastic multi-armed bandit problem, which we call the Fair-MAB problem, where, in addition to the objective of maximizing the sum of expected rewards, the algorithm also needs to ensure that at any time, each arm is pulled at least a pre-specified fraction of times. We investigate the interplay between learning and fairness in terms of a pre-specified vector denoting the fractions of guaranteed pulls. We define a fairness-aware regret, which we call r-Regret, that takes into account the above fairness constraints and extends the conventional notion of regret in a natural way. Our primary contribution is to obtain a complete characterization of a class of Fair-MAB algorithms via two parameters: the unfairness tolerance and the learning algorithm used as a black-box. For this class of algorithms, we provide a fairness guarantee that holds uniformly over time, irrespective of the choice of the learning algorithm. Further, when the learning algorithm is UCB1, we show that our algorithm achieves constant r-Regret for a large enough time horizon. Finally, we analyze the cost of fairness in terms of the conventional notion of regret. We conclude by experimentally validating our theoretical results.


1994 ◽  
Vol 05 (02) ◽  
pp. 153-156
Author(s):  
R. MONASSON

A learning algorithm for the two-layered committee machine is proposed. The proof of its convergence in a finite time is given. Its efficiency is compared to the simple exhaustive enumeration of the internal representations of the training set.


2000 ◽  
Vol 45 (4) ◽  
pp. 711-714 ◽  
Author(s):  
S.R. Kulkarni ◽  
G. Lugosi

2020 ◽  
Vol 34 (04) ◽  
pp. 3341-3348
Author(s):  
Junyu Cao ◽  
Wei Sun ◽  
Zuo-Jun (Max) Shen ◽  
Markus Ettl

As recommender systems send a massive amount of content to keep users engaged, users may experience fatigue which is contributed by 1) an overexposure to irrelevant content, 2) boredom from seeing too many similar recommendations. To address this problem, we consider an online learning setting where a platform learns a policy to recommend content that takes user fatigue into account. We propose an extension of the Dependent Click Model (DCM) to describe users' behavior. We stipulate that for each piece of content, its attractiveness to a user depends on its intrinsic relevance and a discount factor which measures how many similar contents have been shown. Users view the recommended content sequentially and click on the ones that they find attractive. Users may leave the platform at any time, and the probability of exiting is higher when they do not like the content. Based on user's feedback, the platform learns the relevance of the underlying content as well as the discounting effect due to content fatigue. We refer to this learning task as “fatigue-aware DCM Bandit” problem. We consider two learning scenarios depending on whether the discounting effect is known. For each scenario, we propose a learning algorithm which simultaneously explores and exploits, and characterize its regret bound.


Machines ◽  
2021 ◽  
Vol 9 (12) ◽  
pp. 319
Author(s):  
Yi-Liang Yeh ◽  
Po-Kai Yang

This paper presents innovative reinforcement learning methods for automatically tuning the parameters of a proportional integral derivative controller. Conventionally, the high dimension of the Q-table is a primary drawback when implementing a reinforcement learning algorithm. To overcome the obstacle, the idea underlying the n-armed bandit problem is used in this paper. Moreover, gain-scheduled actions are presented to tune the algorithms to improve the overall system behavior; therefore, the proposed controllers fulfill the multiple performance requirements. An experiment was conducted for the piezo-actuated stage to illustrate the effectiveness of the proposed control designs relative to competing algorithms.


Optimization ◽  
1976 ◽  
Vol 7 (3) ◽  
pp. 471-475 ◽  
Author(s):  
P.W. Jones
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document