A learning algorithm for the finite-time two-armed bandit problem

A contextual bandit problem is studied in a highly non-stationary environment, which is ubiquitous in various recommender systems due to the time-varying interests of users. Two models with disjoint and hybrid payoffs are considered to characterize the phenomenon that users' preferences towards different items vary differently over time. In the disjoint payoff model, the reward of playing an arm is determined by an arm-specific preference vector, which is piecewise-stationary with asynchronous and distinct changes across different arms. An efficient learning algorithm that is adaptive to abrupt reward changes is proposed and theoretical regret analysis is provided to show that a sublinear scaling of regret in the time length T is achieved. The algorithm is further extended to a more general setting with hybrid payoffs where the reward of playing an arm is determined by both an arm-specific preference vector and a joint coefficient vector shared by all arms. Empirical experiments are conducted on real-world datasets to verify the advantages of the proposed learning algorithms against baseline ones in both settings.

Download Full-text

Achieving Fairness in the Stochastic Multi-Armed Bandit Problem

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5986 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5379-5386

Author(s):

Vishakha Patil ◽

Ganesh Ghalme ◽

Vineet Nair ◽

Y. Narahari

Keyword(s):

Learning Algorithm ◽

Complete Characterization ◽

Bandit Problem ◽

Fairness Constraints ◽

Two Parameters ◽

The Cost ◽

Primary Contribution ◽

Theoretical Results ◽

Natural Way

We study an interesting variant of the stochastic multi-armed bandit problem, which we call the Fair-MAB problem, where, in addition to the objective of maximizing the sum of expected rewards, the algorithm also needs to ensure that at any time, each arm is pulled at least a pre-specified fraction of times. We investigate the interplay between learning and fairness in terms of a pre-specified vector denoting the fractions of guaranteed pulls. We define a fairness-aware regret, which we call r-Regret, that takes into account the above fairness constraints and extends the conventional notion of regret in a natural way. Our primary contribution is to obtain a complete characterization of a class of Fair-MAB algorithms via two parameters: the unfairness tolerance and the learning algorithm used as a black-box. For this class of algorithms, we provide a fairness guarantee that holds uniformly over time, irrespective of the choice of the learning algorithm. Further, when the learning algorithm is UCB1, we show that our algorithm achieves constant r-Regret for a large enough time horizon. Finally, we analyze the cost of fairness in terms of the conventional notion of regret. We conclude by experimentally validating our theoretical results.

Download Full-text

Comments on “Finite-Time Analysis of the Multiarmed Bandit Problem”

2019 International Conference on Machine Learning and Cybernetics (ICMLC) ◽

10.1109/icmlc48188.2019.8949232 ◽

2019 ◽

Cited By ~ 1

Author(s):

Lu-Ning Zhang ◽

Xin Zuo ◽

Jian-Wei Liu ◽

Wei-Min Li ◽

Nobuyasu Ito

Keyword(s):

Finite Time ◽

Bandit Problem ◽

Time Analysis ◽

Multiarmed Bandit

Download Full-text

Letter to the Editor: A STORAGE ALGORITHM FOR TWO-LAYERED NEURAL NETWORKS

International Journal of Neural Systems ◽

10.1142/s0129065794000177 ◽

1994 ◽

Vol 05 (02) ◽

pp. 153-156

Author(s):

R. MONASSON

Keyword(s):

Neural Networks ◽

Finite Time ◽

Learning Algorithm ◽

Training Set ◽

Committee Machine ◽

Internal Representations ◽

Letter To The Editor

A learning algorithm for the two-layered committee machine is proposed. The proof of its convergence in a finite time is given. Its efficiency is compared to the simple exhaustive enumeration of the internal representations of the training set.

Download Full-text

Finite-time lower bounds for the two-armed bandit problem

IEEE Transactions on Automatic Control ◽

10.1109/9.847107 ◽

2000 ◽

Vol 45 (4) ◽

pp. 711-714 ◽

Cited By ~ 12

Author(s):

S.R. Kulkarni ◽

G. Lugosi

Keyword(s):

Lower Bounds ◽

Finite Time ◽

Bandit Problem

Download Full-text

Fatigue-Aware Bandits for Dependent Click Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5735 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3341-3348

Author(s):

Junyu Cao ◽

Wei Sun ◽

Zuo-Jun (Max) Shen ◽

Markus Ettl

Keyword(s):

Online Learning ◽

Recommender Systems ◽

Learning Algorithm ◽

Learning Task ◽

Discount Factor ◽

Bandit Problem ◽

User Fatigue ◽

Learning Scenarios ◽

Click Model ◽

Regret Bound

As recommender systems send a massive amount of content to keep users engaged, users may experience fatigue which is contributed by 1) an overexposure to irrelevant content, 2) boredom from seeing too many similar recommendations. To address this problem, we consider an online learning setting where a platform learns a policy to recommend content that takes user fatigue into account. We propose an extension of the Dependent Click Model (DCM) to describe users' behavior. We stipulate that for each piece of content, its attractiveness to a user depends on its intrinsic relevance and a discount factor which measures how many similar contents have been shown. Users view the recommended content sequentially and click on the ones that they find attractive. Users may leave the platform at any time, and the probability of exiting is higher when they do not like the content. Based on user's feedback, the platform learns the relevance of the underlying content as well as the discounting effect due to content fatigue. We refer to this learning task as “fatigue-aware DCM Bandit” problem. We consider two learning scenarios depending on whether the discounting effect is known. For each scenario, we propose a learning algorithm which simultaneously explores and exploits, and characterize its regret bound.

Download Full-text

Finite-time convergent distributed cooperative learning algorithm for data approximation

2016 35th Chinese Control Conference (CCC) ◽

10.1109/chicc.2016.7554632 ◽

2016 ◽

Author(s):

Yanfei Song ◽

Weisheng Chen ◽

Hao Dai

Keyword(s):

Cooperative Learning ◽

Finite Time ◽

Learning Algorithm ◽

Data Approximation

Download Full-text

Finite-time analysis of the multi-armed bandit problem with known trend

2016 IEEE Congress on Evolutionary Computation (CEC) ◽

10.1109/cec.2016.7744106 ◽

2016 ◽

Cited By ~ 2

Author(s):

Djallel Bouneffouf

Keyword(s):

Finite Time ◽

Bandit Problem ◽

Time Analysis

Download Full-text

Design and Comparison of Reinforcement-Learning-Based Time-Varying PID Controllers with Gain-Scheduled Actions

Machines ◽

10.3390/machines9120319 ◽

2021 ◽

Vol 9 (12) ◽

pp. 319

Author(s):

Yi-Liang Yeh ◽

Po-Kai Yang

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Pid Controllers ◽

Time Varying ◽

Bandit Problem ◽

Proportional Integral Derivative ◽

System Behavior ◽

Performance Requirements ◽

Gain Scheduled ◽

Control Designs

This paper presents innovative reinforcement learning methods for automatically tuning the parameters of a proportional integral derivative controller. Conventionally, the high dimension of the Q-table is a primary drawback when implementing a reinforcement learning algorithm. To overcome the obstacle, the idea underlying the n-armed bandit problem is used in this paper. Moreover, gain-scheduled actions are presented to tune the algorithms to improve the overall system behavior; therefore, the proposed controllers fulfill the multiple performance requirements. An experiment was conducted for the piezo-actuated stage to illustrate the effectiveness of the proposed control designs relative to competing algorithms.

Download Full-text

Some results for the two armed bandit problem

Optimization ◽

10.1080/02331937608842354 ◽

1976 ◽

Vol 7 (3) ◽

pp. 471-475 ◽

Cited By ~ 1

Author(s):

P.W. Jones

Keyword(s):

Bandit Problem

Download Full-text