Dependency of regret on accuracy of variance estimation for different versions of UCB strategy for Gaussian multi-armed bandits
2021 ◽
Vol 2052
(1)
◽
pp. 012013
Keyword(s):
Abstract We consider two variations of upper confidence bound strategy for Gaussian two-armed bandits. Rewards for the arms are assumed to have unknown expected values and unknown variances. It is demonstrated that expected regret values for both discussed strategies are continuous functions of reward variance. A set of Monte-Carlo simulations was performed to show the nature of the relation between variance estimation and losses. It is shown that the regret grows only slightly when the estimation error is fairly large, which allows to estimate the variance during the initial steps of the control and stop this estimation later.
Keyword(s):
Keyword(s):
2011 ◽
Vol 1
(0)
◽
pp. 126-129
◽
Keyword(s):
Keyword(s):