On transforming an index for generalised bandit problems

K. D. Glazebrook; S. Greatrix

doi:10.2307/3214927

On transforming an index for generalised bandit problems

Journal of Applied Probability ◽

10.1017/s0021900200102633 ◽

1995 ◽

Vol 32 (01) ◽

pp. 168-182

Author(s):

K. D. Glazebrook ◽

S. Greatrix

Keyword(s):

Dynamic Programming ◽

Policy Evaluation ◽

Gittins Index ◽

Bandit Problem ◽

Bandit Problems ◽

Index Policies

Nash (1980) demonstrated that index policies are optimal for a class of generalised bandit problem. A transform of the index concerned has many of the attributes of the Gittins index. The transformed index is positive-valued, with maximal values yielding optimal actions. It may be characterised as the value of a restart problem and is hence computable via dynamic programming methodologies. The transformed index can also be used in procedures for policy evaluation.

Download Full-text

Index policies for discounted bandit problems with availability constraints

Advances in Applied Probability ◽

10.1017/s0001867800002573 ◽

2008 ◽

Vol 40 (02) ◽

pp. 377-400 ◽

Cited By ~ 1

Author(s):

Savas Dayanik ◽

Warren Powell ◽

Kazutoshi Yamazaki

Keyword(s):

Bandit Problem ◽

Bandit Problems ◽

Index Policy ◽

State Action ◽

Index Policies ◽

Availability Constraints ◽

Whittle Index ◽

Multiarmed Bandit

A multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/action-dependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that problem. The Whittle index policy is derived, and its properties are studied. Then it is assumed that the arms may break down, but repair is an option at some cost, and the new Whittle index policy is derived. Both problems are indexable. The proposed index policies cannot be dominated by any other index policy over all multiarmed bandit problems considered here. Whittle indices are evaluated for Bernoulli arms with unknown success probabilities.

Download Full-text

Index policies for discounted bandit problems with availability constraints

Advances in Applied Probability ◽

10.1239/aap/1214950209 ◽

2008 ◽

Vol 40 (2) ◽

pp. 377-400 ◽

Cited By ~ 5

Author(s):

Savas Dayanik ◽

Warren Powell ◽

Kazutoshi Yamazaki

Keyword(s):

Bandit Problem ◽

Bandit Problems ◽

Index Policy ◽

State Action ◽

Index Policies ◽

Availability Constraints ◽

Whittle Index ◽

Multiarmed Bandit

A multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/action-dependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that problem. The Whittle index policy is derived, and its properties are studied. Then it is assumed that the arms may break down, but repair is an option at some cost, and the new Whittle index policy is derived. Both problems are indexable. The proposed index policies cannot be dominated by any other index policy over all multiarmed bandit problems considered here. Whittle indices are evaluated for Bernoulli arms with unknown success probabilities.

Download Full-text

Some indexable families of restless bandit problems

Advances in Applied Probability ◽

10.1239/aap/1158684996 ◽

2006 ◽

Vol 38 (3) ◽

pp. 643-672 ◽

Cited By ~ 32

Author(s):

K. D. Glazebrook ◽

D. Ruiz-Hernandez ◽

C. Kirkbride

Keyword(s):

Index Theory ◽

Stochastic Scheduling ◽

Gittins Index ◽

Scheduling Problems ◽

Bandit Problems ◽

Index Policy ◽

Restless Bandit ◽

Machine Maintenance ◽

State Evolution ◽

Strong Performance

In 1988 Whittle introduced an important but intractable class of restless bandit problems which generalise the multiarmed bandit problems of Gittins by allowing state evolution for passive projects. Whittle's account deployed a Lagrangian relaxation of the optimisation problem to develop an index heuristic. Despite a developing body of evidence (both theoretical and empirical) which underscores the strong performance of Whittle's index policy, a continuing challenge to implementation is the need to establish that the competing projects all pass an indexability test. In this paper we employ Gittins' index theory to establish the indexability of (inter alia) general families of restless bandits which arise in problems of machine maintenance and stochastic scheduling problems with switching penalties. We also give formulae for the resulting Whittle indices. Numerical investigations testify to the outstandingly strong performance of the index heuristics concerned.

Download Full-text

ON THE IDENTIFICATION AND MITIGATION OF WEAKNESSES IN THE KNOWLEDGE GRADIENT POLICY FOR MULTI-ARMED BANDITS

Probability in the Engineering and Informational Sciences ◽

10.1017/s0269964816000279 ◽

2016 ◽

Vol 31 (2) ◽

pp. 239-263 ◽

Cited By ~ 2

Author(s):

James Edwards ◽

Paul Fearnhead ◽

Kevin Glazebrook

Keyword(s):

Decision Making ◽

Exponential Family ◽

Numerical Study ◽

Ranking And Selection ◽

Gittins Index ◽

Selection Problems ◽

Index Policies ◽

Online Decision Making ◽

New Policies ◽

Knowledge Gradient

The knowledge gradient (KG) policy was originally proposed for online ranking and selection problems but has recently been adapted for use in online decision-making in general and multi-armed bandit problems (MABs) in particular. We study its use in a class of exponential family MABs and identify weaknesses, including a propensity to take actions which are dominated with respect to both exploitation and exploration. We propose variants of KG which avoid such errors. These new policies include an index heuristic, which deploys a KG approach to develop an approximation to the Gittins index. A numerical study shows this policy to perform well over a range of MABs including those for which index policies are not optimal. While KG does not take dominated actions when bandits are Gaussian, it fails to be index consistent and appears not to enjoy a performance advantage over competitor policies when arms are correlated to compensate for its greater computational demands.

Download Full-text

A Note on the Equivalence of Upper Confidence Bounds and Gittins Indices for Patient Agents

Operations Research ◽

10.1287/opre.2020.1987 ◽

2020 ◽

Author(s):

Daniel Russo

Keyword(s):

Posterior Distribution ◽

Error Term ◽

Discount Factor ◽

Gittins Index ◽

Confidence Bound ◽

Bandit Problem ◽

Confidence Bounds ◽

Upper Confidence Bound ◽

Gittins Indices ◽

Multiarmed Bandit

This note gives a short, self-contained proof of a sharp connection between Gittins indices and Bayesian upper confidence bound algorithms. I consider a Gaussian multiarmed bandit problem with discount factor [Formula: see text]. The Gittins index of an arm is shown to equal the [Formula: see text]-quantile of the posterior distribution of the arm's mean plus an error term that vanishes as [Formula: see text]. In this sense, for sufficiently patient agents, a Gittins index measures the highest plausible mean-reward of an arm in a manner equivalent to an upper confidence bound.

Download Full-text

Open bandit processes and optimal scheduling of queueing networks

Advances in Applied Probability ◽

10.2307/1427399 ◽

1988 ◽

Vol 20 (2) ◽

pp. 447-472 ◽

Cited By ~ 32

Author(s):

Tze Leung Lai ◽

Zhiliang Ying

Keyword(s):

Queueing Networks ◽

Queueing Systems ◽

Optimal Scheduling ◽

Discount Factor ◽

Asymptotic Approximations ◽

Bandit Problem ◽

Bandit Problems ◽

Optimal Policies ◽

Gittins Indices

Asymptotic approximations are developed herein for the optimal policies in discounted multi-armed bandit problems in which new projects are continually appearing, commonly known as ‘open bandit problems’ or ‘arm-acquiring bandits’. It is shown that under certain stability assumptions the open bandit problem is asymptotically equivalent to a closed bandit problem in which there is no arrival of new projects, as the discount factor approaches 1. Applications of these results to optimal scheduling of queueing networks are given. In particular, Klimov&s priority indices for scheduling queueing networks are shown to be limits of the Gittins indices for the associated closed bandit problem, and extensions of Klimov&s results to preemptive policies and to unstable queueing systems are given.

Download Full-text

Index policies for a class of discounted restless bandits

Advances in Applied Probability ◽

10.1017/s0001867800011903 ◽

2002 ◽

Vol 34 (04) ◽

pp. 754-774 ◽

Cited By ~ 8

Author(s):

K. D. Glazebrook ◽

J. Niño-Mora ◽

P. S. Ansell

Keyword(s):

Conservation Laws ◽

Special Class ◽

Computational Study ◽

Bandit Problems ◽

Index Policy ◽

Restless Bandit ◽

Restless Bandits ◽

Index Policies ◽

Strong Performance ◽

Dual Speed

The paper concerns a class of discounted restless bandit problems which possess an indexability property. Conservation laws yield an expression for the reward suboptimality of a general policy. These results are utilised to study the closeness to optimality of an index policy for a special class of simple and natural dual speed restless bandits for which indexability is guaranteed. The strong performance of the index policy is confirmed by a computational study.

Download Full-text

Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits Under Realizability

Mathematics of Operations Research ◽

10.1287/moor.2021.1193 ◽

2021 ◽

Author(s):

David Simchi-Levi ◽

Yunzong Xu

Keyword(s):

Open Problem ◽

Optimal Algorithm ◽

Simple Algorithm ◽

Direct Consequence ◽

Function Class ◽

General Function ◽

Bandit Problem ◽

Bandit Problems ◽

Important Open Problem ◽

Optimal Reduction

We consider the general (stochastic) contextual bandit problem under the realizability assumption, that is, the expected reward, as a function of contexts and actions, belongs to a general function class [Formula: see text]. We design a fast and simple algorithm that achieves the statistically optimal regret with only [Formula: see text] calls to an offline regression oracle across all T rounds. The number of oracle calls can be further reduced to [Formula: see text] if T is known in advance. Our results provide the first universal and optimal reduction from contextual bandits to offline regression, solving an important open problem in the contextual bandit literature. A direct consequence of our results is that any advances in offline regression immediately translate to contextual bandits, statistically and computationally. This leads to faster algorithms and improved regret guarantees for broader classes of contextual bandit problems.

Download Full-text

Open bandit processes and optimal scheduling of queueing networks

Advances in Applied Probability ◽

10.1017/s0001867800017067 ◽

1988 ◽

Vol 20 (02) ◽

pp. 447-472 ◽

Cited By ~ 7

Author(s):

Tze Leung Lai ◽

Zhiliang Ying

Keyword(s):

Queueing Networks ◽

Queueing Systems ◽

Optimal Scheduling ◽

Discount Factor ◽

Asymptotic Approximations ◽

Bandit Problem ◽

Bandit Problems ◽

Optimal Policies ◽

Gittins Indices

Asymptotic approximations are developed herein for the optimal policies in discounted multi-armed bandit problems in which new projects are continually appearing, commonly known as ‘open bandit problems’ or ‘arm-acquiring bandits’. It is shown that under certain stability assumptions the open bandit problem is asymptotically equivalent to a closed bandit problem in which there is no arrival of new projects, as the discount factor approaches 1. Applications of these results to optimal scheduling of queueing networks are given. In particular, Klimov&s priority indices for scheduling queueing networks are shown to be limits of the Gittins indices for the associated closed bandit problem, and extensions of Klimov&s results to preemptive policies and to unstable queueing systems are given.

Download Full-text