multiarmed bandit
Recently Published Documents


TOTAL DOCUMENTS

73
(FIVE YEARS 22)

H-INDEX

13
(FIVE YEARS 3)

Information ◽  
2021 ◽  
Vol 12 (12) ◽  
pp. 521
Author(s):  
Xiaohan Kang ◽  
Hong Ri ◽  
Mohd Nor Akmal Khalid ◽  
Hiroyuki Iida

The attraction of games comes from the player being able to have fun in games. Gambling games that are based on the Variable-Ratio schedule in Skinner’s experiment are the most typical addictive games. It is necessary to clarify the reason why typical gambling games are simple but addictive. Also, the Multiarmed Bandit game is a typical test for Skinner Box design and is most popular in the gambling house, which is a good example to analyze. This article mainly focuses on expanding on the idea of the motion in mind model in the scene of Multiarmed Bandit games, quantifying the player’s psychological inclination by simulation experimental data. By relating with the quantification of player satisfaction and play comfort, the expectation’s feeling is discussed from the energy perspective. Two different energies are proposed: player-side (Er) and game-side energy (Ei). This provides the difference of player-side (Er) and game-side energy (Ei), denoted as Ed to show the player’s psychological gap. Ten settings of mass bandit were simulated. It was found that the setting of the best player confidence (Er) and entry difficulty (Ei) can balance player expectation. The simulation results show that when m=0.3,0.7, the player has the biggest psychological gap, which expresses that player will be motivated by not being reconciled. Moreover, addiction is likely to occur when m∈[0.5,0.7]. Such an approach can also help the developers and educators increase edutainment games’ efficiency and make the game more attractive.


Author(s):  
Gábor Lugosi ◽  
Abbas Mehrabian

We study multiplayer stochastic multiarmed bandit problems in which the players cannot communicate, and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward. We consider two feedback models: a model in which the players can observe whether a collision has occurred and a more difficult setup in which no collision information is available. We give the first theoretical guarantees for the second model: an algorithm with a logarithmic regret and an algorithm with a square-root regret that does not depend on the gaps between the means. For the first model, we give the first square-root regret bounds that do not depend on the gaps. Building on these ideas, we also give an algorithm for reaching approximate Nash equilibria quickly in stochastic anticoordination games.


2021 ◽  
Author(s):  
Yining Wang ◽  
Boxiao Chen ◽  
David Simchi-Levi

We consider a single product dynamic pricing with demand learning. The candidate prices belong to a wide range of a price interval; the modeling of the demand functions is nonparametric in nature, imposing only smoothness regularity conditions. One important aspect of our model is the possibility of the expected reward function to be nonconcave and indeed multimodal, which leads to many conceptual and technical challenges. Our proposed algorithm is inspired by both the Upper-Confidence-Bound algorithm for multiarmed bandit and the Optimism-in-the-Face-of-Uncertainty principle arising from linear contextual bandits. The multiarmed bandit formulation arises from local-bin approximation of an unknown continuous demand function, and the linear contextual bandit formulation is then applied to obtain more accurate local polynomial approximators within each bin. Through rigorous regret analysis, we demonstrate that our proposed algorithm achieves optimal worst-case regret over a wide range of smooth function classes. More specifically, for k-times smooth functions and T selling periods, the regret of our proposed algorithm is [Formula: see text], which is shown to be optimal via the development of information theoretical lower bounds. We also show that in special cases, such as strongly concave or infinitely smooth reward functions, our algorithm achieves an [Formula: see text] regret, matching optimal regret established in previous works. Finally, we present computational results that verify the effectiveness of our method in numerical simulations. This paper was accepted by J. George Shanthikumar, big data analytics.


2021 ◽  
Author(s):  
Ramesh Johari ◽  
Vijay Kamble ◽  
Yash Kanoria

Platforms face a cold start problem whenever new users arrive: namely, the platform must learn attributes of new users (explore) in order to match them better in the future (exploit). How should a platform handle cold starts when there are limited quantities of the items being recommended? For instance, how should a labor market platform match workers to jobs over the lifetime of the worker, given a limited supply of jobs? In this setting, there is one multiarmed bandit problem for each worker, coupled together by the constrained supply of jobs of different types. A solution is developed to this problem. It is found that the platform should estimate a shadow price for each job type, and for each worker, adjust payoffs by these prices (i) to balance learning with payoffs early on and (ii) to myopically match them thereafter.


2021 ◽  
Vol 66 (1) ◽  
pp. 476-478
Author(s):  
Paul Reverdy ◽  
Vaibhav Srivastava ◽  
Naomi Ehrich Leonard

2021 ◽  
Vol 59 (6) ◽  
pp. 4666-4688
Author(s):  
Wenqing Bao ◽  
Xiaoqiang Cai ◽  
Xianyi Wu

2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Jyh-Yih Hsu ◽  
Wei-Kuo Tseng ◽  
Jia-You Hsieh ◽  
Chao-Jen Chang ◽  
Huan Chen

In recent years, sales of agricultural products in Taiwan have been transformed into electronic marketing, and agricultural products with better consumer orientation have been recommended, and farmers’ income has been improved through sales websites. In the past, A/B testing was used to determine the degree of preference for website solutions, which required a large number of tests for evaluation, and could not respond to environmental variables that made it difficult to predict the actual recommendation in advance. Therefore, in this study, the reinforcement learning model combined with different contextual Multiarmed Bandit algorithms can be tested in data sets of different complexity, which can actually perform well in changing products. It is helpful to predict the preferences of the promotion model.


2020 ◽  
Author(s):  
Daniel Russo

This note gives a short, self-contained proof of a sharp connection between Gittins indices and Bayesian upper confidence bound algorithms. I consider a Gaussian multiarmed bandit problem with discount factor [Formula: see text]. The Gittins index of an arm is shown to equal the [Formula: see text]-quantile of the posterior distribution of the arm's mean plus an error term that vanishes as [Formula: see text]. In this sense, for sufficiently patient agents, a Gittins index measures the highest plausible mean-reward of an arm in a manner equivalent to an upper confidence bound.


Sign in / Sign up

Export Citation Format

Share Document