scholarly journals AdaLinUCB: Opportunistic Learning for Contextual Bandits

Author(s):  
Xueying Guo ◽  
Xiaoxiao Wang ◽  
Xin Liu

In this paper, we propose and study opportunistic contextual bandits - a special case of contextual bandits where the exploration cost varies under different environmental conditions, such as network load or return variation in recommendations. When the exploration cost is low, so is the actual regret of pulling a sub-optimal arm (e.g., trying a suboptimal recommendation). Therefore, intuitively, we could explore more when the exploration cost is relatively low and exploit more when the exploration cost is relatively high. Inspired by this intuition, for opportunistic contextual bandits with Linear payoffs, we propose an Adaptive Upper-Confidence-Bound algorithm (AdaLinUCB) to adaptively balance the exploration-exploitation trade-off for opportunistic learning. We prove that AdaLinUCB achieves O((log T)^2) problem-dependent regret upper bound, which has a smaller coefficient than that of the traditional LinUCB algorithm. Moreover, based on both synthetic and real-world dataset, we show that AdaLinUCB significantly outperforms other contextual bandit algorithms, under large exploration cost fluctuations.

Author(s):  
Julian Berk ◽  
Sunil Gupta ◽  
Santu Rana ◽  
Svetha Venkatesh

In order to improve the performance of Bayesian optimisation, we develop a modified Gaussian process upper confidence bound (GP-UCB) acquisition function. This is done by sampling the exploration-exploitation trade-off parameter from a distribution. We prove that this allows the expected trade-off parameter to be altered to better suit the problem without compromising a bound on the function's Bayesian regret. We also provide results showing that our method achieves better performance than GP-UCB in a range of real-world and synthetic problems.


2020 ◽  
Vol 34 (04) ◽  
pp. 7023-7030
Author(s):  
Jinhang Zuo ◽  
Xiaoxi Zhang ◽  
Carlee Joe-Wong

We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for K arms with Bernoulli rewards, and prove a T-round regret upper bound O(K2log T). In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its T-round regret relative to an offline greedy strategy is upper bounded in O(K4/M2log T) for K arms and M players. We also propose distributed versions of the C-MP-OBP policy, called D-MP-OBP and D-MP-Adapt-OBP, achieving logarithmic regret with respect to collision-free target policies. Experiments on synthetic data and wireless channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and offline optimal policies that do not allow pre-observations.


Author(s):  
Ruida Zhou ◽  
Chao Gan ◽  
Jing Yang ◽  
Cong Shen

In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an {\it ordered} list of items and \congr{examines} them sequentially, until certain stopping condition is satisfied. Our objective is then to maximize the expected {\it net reward} in each step, i.e., the reward obtained in each step minus the total cost incurred in examining the items, by deciding the ordered list of items, as well as when to stop examination. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the offline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cascading Upper Confidence Bound (CC-UCB) algorithm, and show that the cumulative regret scales in $O(\log T)$. We also provide a lower bound for all $\alpha$-consistent policies, which scales in $\Omega(\log T)$ and matches our upper bound. The performance of the CC-UCB algorithm is evaluated with both synthetic and real-world data.


Author(s):  
Yi-Qi Hu ◽  
Yang Yu ◽  
Jun-Da Liao

An automatic machine learning (AutoML) task is to select the best algorithm and its hyper-parameters simultaneously. Previously, the hyper-parameters of all algorithms are joint as a single search space, which is not only huge but also redundant, because many dimensions of hyper-parameters are irrelevant with the selected algorithms. In this paper, we propose a cascaded approach for algorithm selection and hyper-parameter optimization. While a search procedure is employed at the level of hyper-parameter optimization, a bandit strategy runs at the level of algorithm selection to allocate the budget based on the search feedbacks. Since the bandit is required to select the algorithm with the maximum performance, instead of the average performance, we thus propose the extreme-region upper confidence bound (ER-UCB) strategy, which focuses on the extreme region of the underlying feedback distribution. We show theoretically that the ER-UCB has a regret upper bound O(K ln n) with independent feedbacks, which is as efficient as the classical UCB bandit. We also conduct experiments on a synthetic problem as well as a set of AutoML tasks. The results verify the effectiveness of the proposed method.


Author(s):  
Hamsa Bastani ◽  
Mohsen Bayati ◽  
Khashayar Khosravi

The contextual bandit literature has traditionally focused on algorithms that address the exploration–exploitation tradeoff. In particular, greedy algorithms that exploit current estimates without any exploration may be suboptimal in general. However, exploration-free greedy algorithms are desirable in practical settings where exploration may be costly or unethical (e.g., clinical trials). Surprisingly, we find that a simple greedy algorithm can be rate optimal (achieves asymptotically optimal regret) if there is sufficient randomness in the observed contexts (covariates). We prove that this is always the case for a two-armed bandit under a general class of context distributions that satisfy a condition we term covariate diversity. Furthermore, even absent this condition, we show that a greedy algorithm can be rate optimal with positive probability. Thus, standard bandit algorithms may unnecessarily explore. Motivated by these results, we introduce Greedy-First, a new algorithm that uses only observed contexts and rewards to determine whether to follow a greedy algorithm or to explore. We prove that this algorithm is rate optimal without any additional assumptions on the context distribution or the number of arms. Extensive simulations demonstrate that Greedy-First successfully reduces exploration and outperforms existing (exploration-based) contextual bandit algorithms such as Thompson sampling or upper confidence bound. This paper was accepted by J. George Shanthikumar, big data analytics.


Author(s):  
Mark Burgess

Upper confidence bound multi-armed bandit algorithms (UCB) typically rely on concentration in- equalities (such as Hoeffding’s inequality) for the creation of the upper confidence bound. Intu- itively, the tighter the bound is, the more likely the respective arm is or isn’t judged appropriately for selection. Hence we derive and utilise an optimal inequality. Usually the sample mean (and sometimes the sample variance) of previous rewards are the information which are used in the bounds which drive the algorithm, but intuitively the more infor- mation that taken from the previous rewards, the tighter the bound could be. Hence our inequality explicitly considers the values of each and every past reward into the upper bound expression which drives the method. We show how this UCB method fits into the broader scope of other information theoretic UCB algorithms, but unlike them is free from assumptions about the distribution of the data, We conclude by reporting some already established regret information, and give some numerical simulations to demonstrate the method’s effectiveness.


2020 ◽  
Vol 24 (1) ◽  
pp. 58
Author(s):  
Anwar Hafidzi

This research begins with an understanding of the endemic radicalism of society, not only of the real world, but also of various online social media. This study showed that the avoidance of online radicalism can be stopped as soon as possible by accusing those influenced by the radical radicality of a secular religious approach. The methods used must be assisted in order to achieve balanced understanding (wasathiyah) under the different environmental conditions of the culture through recognizing the meaning of religion. The research tool used is primarily library work and the journal writings by Abu Rokhmad, a terrorist and radicalise specialist. The results of this study are that an approach that supports inclusive ism will avoid the awareness of radicalization through a heart-to-heart approach. This study also shows that radical actors will never cease to argue dramatically until they are able to grasp different views from Islamic law, culture, and families.Keywords: radicalism, deradicalization, multiculturalism, culture, religion, moderate.Penelitian ini berawal dari paham radikalisme yang telah mewabah di masyarakat, bukan hanya di dunia nyata, bahkan sudah menyusup di berbagai media sosial online. Penelitian ini menemukan bahwa cara menangkal radikalisme online dapat dilakukan pencegahan sedini mungkin melalui pendekatan konseling religius multikultural terhadap mereka yang terkena paham radikal radikal. Diantara teknik yang digunakan adalah melalui pemahaman tentang konsep agama juga perlu digalakkan agar memunculkan pemahaman yang moderat (wasathiyah) diberbagai keadaan lingkungan masyarakat. Metode yang digunakan untuk penelitian ini adalah library research dengan sumber utama adalah karya dan jurnal karya Abu Rokhmad seorang pakar dalam masalah terorisme dan radikalisme. Temuan penelitian ini adalah paham radikalisasi itu dapat dihentikan dengan pendekatan hati ke hati dengan mengedepankan budaya yang multikultural. Kajian ini juga membuktikan bahwa pelaku paham radikal tidak akan pernah berhenti memberikan argumen radikal kecuali mampu memahami perbedaan pendapat yang bersumber dari syariat Islam, lingkungan sosial, dan keluarga.Kata kunci: radikalisme, deradikalisasi, multikultural, budaya, agama, moderat.


2019 ◽  
Vol 9 (20) ◽  
pp. 4303 ◽  
Author(s):  
Jaroslav Melesko ◽  
Vitalij Novickij

There is strong support for formative assessment inclusion in learning processes, with the main emphasis on corrective feedback for students. However, traditional testing and Computer Adaptive Testing can be problematic to implement in the classroom. Paper based tests are logistically inconvenient and are hard to personalize, and thus must be longer to accurately assess every student in the classroom. Computer Adaptive Testing can mitigate these problems by making use of Multi-Dimensional Item Response Theory at cost of introducing several new problems, most problematic of which are the greater test creation complexity, because of the necessity of question pool calibration, and the debatable premise that different questions measure one common latent trait. In this paper a new approach of modelling formative assessment as a Multi-Armed bandit problem is proposed and solved using Upper-Confidence Bound algorithm. The method in combination with e-learning paradigm has the potential to mitigate such problems as question item calibration and lengthy tests, while providing accurate formative assessment feedback for students. A number of simulation and empirical data experiments (with 104 students) are carried out to explore and measure the potential of this application with positive results.


Author(s):  
Pulak Sarkar ◽  
Solagna Modak ◽  
Santanu Ray ◽  
Vasista Adupa ◽  
K. Anki Reddy ◽  
...  

Liquid transport through the composite membrane is inversely proportional to the thickness of its separation layer. While the scalable fabrication of ultrathin polymer membranes is sought for their commercial exploitation,...


2021 ◽  
pp. 100208
Author(s):  
Mohammed Alshahrani ◽  
Fuxi Zhu ◽  
Soufiana Mekouar ◽  
Mohammed Yahya Alghamdi ◽  
Shichao Liu

Sign in / Sign up

Export Citation Format

Share Document