Globally Informative Thompson Sampling for Structured Bandit Problems with Application to CrowdTranscoding

DUELING BANDIT PROBLEMS

Probability in the Engineering and Informational Sciences ◽

10.1017/s0269964820000601 ◽

2020 ◽

pp. 1-12

Author(s):

Erol Peköz ◽

Sheldon M. Ross ◽

Zhengyu Zhang

Keyword(s):

Decision Maker ◽

Time Frame ◽

Bandit Problems ◽

Infinite Time ◽

Expected Number ◽

Thompson Sampling ◽

Type Algorithm

There is a set of n bandits and at every stage, two of the bandits are chosen to play a game, with the result of a game being learned. In the “weak regret problem,” we suppose there is a “best” bandit that wins each game it plays with probability at least p > 1/2, with the value of p being unknown. The objective is to choose bandits to maximize the number of times that one of the competitors is the best bandit. In the “strong regret problem”, we suppose that bandit i has unknown value v i , i = 1, …, n, and that i beats j with probability v i /(v i + v j ). One version of strong regret is interested in maximizing the number of times that the contest is between the players with the two largest values. Another version supposes that at any stage, rather than choosing two arms to play a game, the decision maker can declare that a particular arm is the best, with the objective of maximizing the number of stages in which the arm with the largest value is declared to be the best. In the weak regret problem, we propose a policy and obtain an analytic bound on the expected number of stages over an infinite time frame that the best arm is not one of the competitors when this policy is employed. In the strong regret problem, we propose a Thompson sampling type algorithm and empirically compare its performance with others in the literature.

Download Full-text

Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm

Entropy ◽

10.3390/e23030380 ◽

2021 ◽

Vol 23 (3) ◽

pp. 380

Author(s):

Emanuele Cavenaghi ◽

Gabriele Sottocornola ◽

Fabio Stella ◽

Markus Zanker

Keyword(s):

Real World ◽

Concept Drift ◽

Empirical Evaluation ◽

Sliding Window ◽

Discount Factor ◽

Data Streaming ◽

Sources Of Information ◽

Sequential Decision ◽

Time Step ◽

Thompson Sampling

The Multi-Armed Bandit (MAB) problem has been extensively studied in order to address real-world challenges related to sequential decision making. In this setting, an agent selects the best action to be performed at time-step t, based on the past rewards received by the environment. This formulation implicitly assumes that the expected payoff for each action is kept stationary by the environment through time. Nevertheless, in many real-world applications this assumption does not hold and the agent has to face a non-stationary environment, that is, with a changing reward distribution. Thus, we present a new MAB algorithm, named f-Discounted-Sliding-Window Thompson Sampling (f-dsw TS), for non-stationary environments, that is, when the data streaming is affected by concept drift. The f-dsw TS algorithm is based on Thompson Sampling (TS) and exploits a discount factor on the reward history and an arm-related sliding window to contrast concept drift in non-stationary environments. We investigate how to combine these two sources of information, namely the discount factor and the sliding window, by means of an aggregation function f(.). In particular, we proposed a pessimistic (f=min), an optimistic (f=max), as well as an averaged (f=mean) version of the f-dsw TS algorithm. A rich set of numerical experiments is performed to evaluate the f-dsw TS algorithm compared to both stationary and non-stationary state-of-the-art TS baselines. We exploited synthetic environments (both randomly-generated and controlled) to test the MAB algorithms under different types of drift, that is, sudden/abrupt, incremental, gradual and increasing/decreasing drift. Furthermore, we adapt four real-world active learning tasks to our framework—a prediction task on crimes in the city of Baltimore, a classification task on insects species, a recommendation task on local web-news, and a time-series analysis on microbial organisms in the tropical air ecosystem. The f-dsw TS approach emerges as the best performing MAB algorithm. At least one of the versions of f-dsw TS performs better than the baselines in synthetic environments, proving the robustness of f-dsw TS under different concept drift types. Moreover, the pessimistic version (f=min) results as the most effective in all real-world tasks.

Download Full-text

Safe Linear Thompson Sampling with Side Information

IEEE Transactions on Signal Processing ◽

10.1109/tsp.2021.3089822 ◽

2021 ◽

pp. 1-1

Author(s):

Ahmadreza Moradipari ◽

Sanae Amani ◽

Mahnoosh Alizadeh ◽

Christos Thrampoulidis

Keyword(s):

Side Information ◽

Thompson Sampling

Download Full-text

TSOR: Thompson Sampling-based Opportunistic Routing

IEEE Transactions on Wireless Communications ◽

10.1109/twc.2021.3082080 ◽

2021 ◽

pp. 1-1

Author(s):

Zhiming Huang ◽

Yifan Xu ◽

Jianping Pan

Keyword(s):

Opportunistic Routing ◽

Thompson Sampling

Download Full-text

A Decentralized Communication Policy for Multi Agent Multi Armed Bandit Problems

2020 European Control Conference (ECC) ◽

10.23919/ecc51009.2020.9143811 ◽

2020 ◽

Author(s):

P. Pankayaraj ◽

D. H. S. Maithripala

Keyword(s):

Communication Policy ◽

Bandit Problems ◽

Multi Agent

Download Full-text

MmWave Codebook Selection in Rapidly-Varying Channels via Multinomial Thompson Sampling

Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing ◽

10.1145/3466772.3467044 ◽

2021 ◽

Author(s):

Yi Zhang ◽

Soumya Basu ◽

Sanjay Shakkottai ◽

Robert W. Heath

Keyword(s):

Thompson Sampling

Download Full-text

On transforming an index for generalised bandit problems

Journal of Applied Probability ◽

10.2307/3214927 ◽

1995 ◽

Vol 32 (1) ◽

pp. 168-182 ◽

Cited By ~ 4

Author(s):

K. D. Glazebrook ◽

S. Greatrix

Keyword(s):

Dynamic Programming ◽

Policy Evaluation ◽

Gittins Index ◽

Bandit Problem ◽

Bandit Problems ◽

Index Policies

Nash (1980) demonstrated that index policies are optimal for a class of generalised bandit problem. A transform of the index concerned has many of the attributes of the Gittins index. The transformed index is positive-valued, with maximal values yielding optimal actions. It may be characterised as the value of a restart problem and is hence computable via dynamic programming methodologies. The transformed index can also be used in procedures for policy evaluation.

Download Full-text