thompson sampling
Recently Published Documents


TOTAL DOCUMENTS

134
(FIVE YEARS 90)

H-INDEX

9
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Carlos Daniel Pohlod ◽  
Sandra M. Venske ◽  
Carolina P. Almeida

Este trabalho propõe uma Hiper-Heurística (HH) de seleção baseada na abordagem Thompson Sampling (TS) para a solução do Problema Quadrático de Alocação (PQA). O PQA tem como objetivo a alocação de instalações em um conjunto de possíveis localidades já conhecidas, a fim de minimizar o custo total de todas as movimentações entre as instalações. A HH proposta é aplicada na configuração automática de um algoritmo memético, atuando na seleção de uma combinação de heurísticas de baixo nível. Cada combinação envolve a seleção de uma heurística de recombinação, de uma estratégia de busca local e de uma heurística de mutação. O algoritmo foi analisado em 15 instâncias do benchmark Nug e o desempenho da HH é superior àquele obtido por qualquer combinação de heurísticas aplicada de forma isolada, demonstrando a sua eficiência na configuração automática do algoritmo. Os experimentos mostram que o desempenho da TS é afetado pela qualidade do conjunto de heurísticas de baixo nível. A melhor versão da HH obtém a solução ótima em 9 instâncias e o desvio médio percentual da solução ótima (gap), considerando todas as 15 instâncias foi de 8,6%, sendo que os maiores gaps foram encontrados para as três maiores instâncias.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Suhansanu Kumar ◽  
Heting Gao ◽  
Changyu Wang ◽  
Kevin Chen-Chuan Chang ◽  
Hari Sundaram

Symmetry ◽  
2021 ◽  
Vol 13 (11) ◽  
pp. 2175
Author(s):  
Miguel Martín ◽  
Antonio Jiménez-Martín ◽  
Alfonso Mateos ◽  
Josefa Z. Hernández

A/B testing is used in digital contexts both to offer a more personalized service and to optimize the e-commerce purchasing process. A personalized service provides customers with the fastest possible access to the contents that they are most likely to use. An optimized e-commerce purchasing process reduces customer effort during online purchasing and assures that the largest possible number of customers place their order. The most widespread A/B testing method is to implement the equivalent of RCT (randomized controlled trials). Recently, however, some companies and solutions have addressed this experimentation process as a multi-armed bandit (MAB). This is known in the A/B testing market as dynamic traffic distribution. A complementary technique used to optimize the performance of A/B testing is to improve the experiment stopping criterion. In this paper, we propose an adaptation of A/B testing to account for possibilistic reward (PR) methods, together with the definition of a new stopping criterion also based on PR methods to be used for both classical A/B testing and A/B testing based on MAB algorithms. A comparative numerical analysis based on the simulation of real scenarios is used to analyze the performance of the proposed adaptations in both Bernoulli and non-Bernoulli environments. In this analysis, we show that the possibilistic reward method PR3 produced the lowest mean cumulative regret in non-Bernoulli environments, which proved to have a high confidence level and be highly stable as demonstrated by low standard deviation measures. PR3 behaves exactly the same as Thompson sampling in Bernoulli environments. The conclusion is that PR3 can be used efficiently in both environments in combination with the value remaining stopping criterion in Bernoulli environments and the PR3 bounds stopping criterion for non-Bernoulli environments.


2021 ◽  
Vol 39 (4) ◽  
pp. 1-29
Author(s):  
Shijun Li ◽  
Wenqiang Lei ◽  
Qingyun Wu ◽  
Xiangnan He ◽  
Peng Jiang ◽  
...  

Static recommendation methods like collaborative filtering suffer from the inherent limitation of performing real-time personalization for cold-start users. Online recommendation, e.g., multi-armed bandit approach, addresses this limitation by interactively exploring user preference online and pursuing the exploration-exploitation (EE) trade-off. However, existing bandit-based methods model recommendation actions homogeneously. Specifically, they only consider the items as the arms, being incapable of handling the item attributes , which naturally provide interpretable information of user’s current demands and can effectively filter out undesired items. In this work, we consider the conversational recommendation for cold-start users, where a system can both ask the attributes from and recommend items to a user interactively. This important scenario was studied in a recent work  [54]. However, it employs a hand-crafted function to decide when to ask attributes or make recommendations. Such separate modeling of attributes and items makes the effectiveness of the system highly rely on the choice of the hand-crafted function, thus introducing fragility to the system. To address this limitation, we seamlessly unify attributes and items in the same arm space and achieve their EE trade-offs automatically using the framework of Thompson Sampling. Our Conversational Thompson Sampling (ConTS) model holistically solves all questions in conversational recommendation by choosing the arm with the maximal reward to play. Extensive experiments on three benchmark datasets show that ConTS outperforms the state-of-the-art methods Conversational UCB (ConUCB) [54] and Estimation—Action—Reflection model [27] in both metrics of success rate and average number of conversation turns.


2021 ◽  
Author(s):  
Hamsa Bastani ◽  
David Simchi-Levi ◽  
Ruihao Zhu

We study the problem of learning shared structure across a sequence of dynamic pricing experiments for related products. We consider a practical formulation in which the unknown demand parameters for each product come from an unknown distribution (prior) that is shared across products. We then propose a meta dynamic pricing algorithm that learns this prior online while solving a sequence of Thompson sampling pricing experiments (each with horizon T) for N different products. Our algorithm addresses two challenges: (i) balancing the need to learn the prior (meta-exploration) with the need to leverage the estimated prior to achieve good performance (meta-exploitation) and (ii) accounting for uncertainty in the estimated prior by appropriately “widening” the estimated prior as a function of its estimation error. We introduce a novel prior alignment technique to analyze the regret of Thompson sampling with a misspecified prior, which may be of independent interest. Unlike prior-independent approaches, our algorithm’s meta regret grows sublinearly in N, demonstrating that the price of an unknown prior in Thompson sampling can be negligible in experiment-rich environments (large N). Numerical experiments on synthetic and real auto loan data demonstrate that our algorithm significantly speeds up learning compared with prior-independent algorithms. This paper was accepted by George J. Shanthikumar for the Management Science Special Issue on Data-Driven Analytics.


Author(s):  
Emil Carlsson ◽  
Devdatt Dubhashi ◽  
Fredrik D. Johansson

We propose algorithms based on a multi-level Thompson sampling scheme, for the stochastic multi-armed bandit and its contextual variant with linear expected rewards, in the setting where arms are clustered. We show, both theoretically and empirically, how exploiting a given cluster structure can significantly improve the regret and computational cost compared to using standard Thompson sampling. In the case of the stochastic multi-armed bandit we give upper bounds on the expected cumulative regret showing how it depends on the quality of the clustering. Finally, we perform an empirical evaluation showing that our algorithms perform well compared to previously proposed algorithms for bandits with clustered arms.


Sign in / Sign up

Export Citation Format

Share Document