A note on the advantage of context in Thompson sampling

Author(s):  
Michael Byrd ◽  
Ross Darrow
Keyword(s):  
Entropy ◽  
2021 ◽  
Vol 23 (3) ◽  
pp. 380
Author(s):  
Emanuele Cavenaghi ◽  
Gabriele Sottocornola ◽  
Fabio Stella ◽  
Markus Zanker

The Multi-Armed Bandit (MAB) problem has been extensively studied in order to address real-world challenges related to sequential decision making. In this setting, an agent selects the best action to be performed at time-step t, based on the past rewards received by the environment. This formulation implicitly assumes that the expected payoff for each action is kept stationary by the environment through time. Nevertheless, in many real-world applications this assumption does not hold and the agent has to face a non-stationary environment, that is, with a changing reward distribution. Thus, we present a new MAB algorithm, named f-Discounted-Sliding-Window Thompson Sampling (f-dsw TS), for non-stationary environments, that is, when the data streaming is affected by concept drift. The f-dsw TS algorithm is based on Thompson Sampling (TS) and exploits a discount factor on the reward history and an arm-related sliding window to contrast concept drift in non-stationary environments. We investigate how to combine these two sources of information, namely the discount factor and the sliding window, by means of an aggregation function f(.). In particular, we proposed a pessimistic (f=min), an optimistic (f=max), as well as an averaged (f=mean) version of the f-dsw TS algorithm. A rich set of numerical experiments is performed to evaluate the f-dsw TS algorithm compared to both stationary and non-stationary state-of-the-art TS baselines. We exploited synthetic environments (both randomly-generated and controlled) to test the MAB algorithms under different types of drift, that is, sudden/abrupt, incremental, gradual and increasing/decreasing drift. Furthermore, we adapt four real-world active learning tasks to our framework—a prediction task on crimes in the city of Baltimore, a classification task on insects species, a recommendation task on local web-news, and a time-series analysis on microbial organisms in the tropical air ecosystem. The f-dsw TS approach emerges as the best performing MAB algorithm. At least one of the versions of f-dsw TS performs better than the baselines in synthetic environments, proving the robustness of f-dsw TS under different concept drift types. Moreover, the pessimistic version (f=min) results as the most effective in all real-world tasks.


Author(s):  
Ahmadreza Moradipari ◽  
Sanae Amani ◽  
Mahnoosh Alizadeh ◽  
Christos Thrampoulidis

Author(s):  
Jan Leike ◽  
Tor Lattimore ◽  
Laurent Orseau ◽  
Marcus Hutter

We discuss some recent results on Thompson sampling for nonparametric reinforcement learning in countable classes of general stochastic environments. These environments can be non-Markovian, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges in mean to the optimal value and (2) given a recoverability assumption regret is sublinear. We conclude with a discussion about optimality in reinforcement learning.


2020 ◽  
Vol 68 ◽  
pp. 311-364
Author(s):  
Francesco Trovo ◽  
Stefano Paladino ◽  
Marcello Restelli ◽  
Nicola Gatti

Multi-Armed Bandit (MAB) techniques have been successfully applied to many classes of sequential decision problems in the past decades. However, non-stationary settings -- very common in real-world applications -- received little attention so far, and theoretical guarantees on the regret are known only for some frequentist algorithms. In this paper, we propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), for nonstationary stochastic MAB settings. Our algorithm is based on Thompson Sampling and exploits a sliding-window approach to tackle, in a unified fashion, two different forms of non-stationarity studied separately so far: abruptly changing and smoothly changing. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. Under mild assumptions, we provide regret upper bounds on the dynamic pseudo-regret of SW-TS for the abruptly changing environment, for the smoothly changing one, and for the setting in which both the non-stationarity forms are present. Furthermore, we empirically show that SW-TS dramatically outperforms state-of-the-art algorithms even when the forms of non-stationarity are taken separately, as previously studied in the literature.


2019 ◽  
Vol 63 (7) ◽  
pp. 1099-1108
Author(s):  
Salem Ameen ◽  
Sunil Vadera

Abstract The successful application of deep learning has led to increasing expectations of their use in embedded systems. This, in turn, has created the need to find ways of reducing the size of neural networks. Decreasing the size of a neural network requires deciding which weights should be removed without compromising accuracy, which is analogous to the kind of problems addressed by multi-armed bandits (MABs). Hence, this paper explores the use of MABs for reducing the number of parameters of a neural network. Different MAB algorithms, namely $\epsilon $-greedy, win-stay, lose-shift, UCB1, KL-UCB, BayesUCB, UGapEb, successive rejects and Thompson sampling are evaluated and their performance compared to existing approaches. The results show that MAB pruning methods, especially those based on UCB, outperform other pruning methods.


Sign in / Sign up

Export Citation Format

Share Document