Open bandit processes and optimal scheduling of queueing networks

1988 ◽  
Vol 20 (02) ◽  
pp. 447-472 ◽  
Author(s):  
Tze Leung Lai ◽  
Zhiliang Ying

Asymptotic approximations are developed herein for the optimal policies in discounted multi-armed bandit problems in which new projects are continually appearing, commonly known as ‘open bandit problems’ or ‘arm-acquiring bandits’. It is shown that under certain stability assumptions the open bandit problem is asymptotically equivalent to a closed bandit problem in which there is no arrival of new projects, as the discount factor approaches 1. Applications of these results to optimal scheduling of queueing networks are given. In particular, Klimov&s priority indices for scheduling queueing networks are shown to be limits of the Gittins indices for the associated closed bandit problem, and extensions of Klimov&s results to preemptive policies and to unstable queueing systems are given.

1988 ◽  
Vol 20 (2) ◽  
pp. 447-472 ◽  
Author(s):  
Tze Leung Lai ◽  
Zhiliang Ying

Asymptotic approximations are developed herein for the optimal policies in discounted multi-armed bandit problems in which new projects are continually appearing, commonly known as ‘open bandit problems’ or ‘arm-acquiring bandits’. It is shown that under certain stability assumptions the open bandit problem is asymptotically equivalent to a closed bandit problem in which there is no arrival of new projects, as the discount factor approaches 1. Applications of these results to optimal scheduling of queueing networks are given. In particular, Klimov&s priority indices for scheduling queueing networks are shown to be limits of the Gittins indices for the associated closed bandit problem, and extensions of Klimov&s results to preemptive policies and to unstable queueing systems are given.


2020 ◽  
Author(s):  
Daniel Russo

This note gives a short, self-contained proof of a sharp connection between Gittins indices and Bayesian upper confidence bound algorithms. I consider a Gaussian multiarmed bandit problem with discount factor [Formula: see text]. The Gittins index of an arm is shown to equal the [Formula: see text]-quantile of the posterior distribution of the arm's mean plus an error term that vanishes as [Formula: see text]. In this sense, for sufficiently patient agents, a Gittins index measures the highest plausible mean-reward of an arm in a manner equivalent to an upper confidence bound.


1995 ◽  
Vol 32 (1) ◽  
pp. 168-182 ◽  
Author(s):  
K. D. Glazebrook ◽  
S. Greatrix

Nash (1980) demonstrated that index policies are optimal for a class of generalised bandit problem. A transform of the index concerned has many of the attributes of the Gittins index. The transformed index is positive-valued, with maximal values yielding optimal actions. It may be characterised as the value of a restart problem and is hence computable via dynamic programming methodologies. The transformed index can also be used in procedures for policy evaluation.


1995 ◽  
Vol 32 (02) ◽  
pp. 494-507 ◽  
Author(s):  
François Baccelli ◽  
Serguei Foss

This paper focuses on the stability of open queueing systems under stationary ergodic assumptions. It defines a set of conditions, the monotone separable framework, ensuring that the stability region is given by the following saturation rule: ‘saturate' the queues which are fed by the external arrival stream; look at the ‘intensity' μ of the departure stream in this saturated system; then stability holds whenever the intensity of the arrival process, say λ satisfies the condition λ < μ, whereas the network is unstable if λ > μ. Whenever the stability condition is satisfied, it is also shown that certain state variables associated with the network admit a finite stationary regime which is constructed pathwise using a Loynes-type backward argument. This framework involves two main pathwise properties, external monotonicity and separability, which are satisfied by several classical queueing networks. The main tool for the proof of this rule is subadditive ergodic theory. It is shown that, for various problems, the proposed method provides an alternative to the methods based on Harris recurrence and regeneration; this is particularly true in the Markov case, where we show that the distributional assumptions commonly made on service or arrival times so as to ensure Harris recurrence can in fact be relaxed.


1997 ◽  
Vol 29 (01) ◽  
pp. 114-137
Author(s):  
Linn I. Sennott

This paper studies the expected average cost control problem for discrete-time Markov decision processes with denumerably infinite state spaces. A sequence of finite state space truncations is defined such that the average costs and average optimal policies in the sequence converge to the optimal average cost and an optimal policy in the original process. The theory is illustrated with several examples from the control of discrete-time queueing systems. Numerical results are discussed.


1987 ◽  
Vol 35 (1) ◽  
pp. 121-126 ◽  
Author(s):  
Katsuhisa Ohno ◽  
Kuniyoshi Ichiki

1993 ◽  
Vol 25 (3) ◽  
pp. 585-606 ◽  
Author(s):  
C. Teresa Lam

In this paper, we study the superposition of finitely many Markov renewal processes with countable state spaces. We define the S-Markov renewal equations associated with the superposed process. The solutions of the S-Markov renewal equations are derived and the asymptotic behaviors of these solutions are studied. These results are applied to calculate various characteristics of queueing systems with superposition semi-Markovian arrivals, queueing networks with bulk service, system availability, and continuous superposition remaining and current life processes.


2000 ◽  
Vol 37 (1) ◽  
pp. 300-305 ◽  
Author(s):  
Mark E. Lewis ◽  
Martin L. Puterman

The use of bias optimality to distinguish among gain optimal policies was recently studied by Haviv and Puterman [1] and extended in Lewis et al. [2]. In [1], upon arrival to an M/M/1 queue, customers offer the gatekeeper a reward R. If accepted, the gatekeeper immediately receives the reward, but is charged a holding cost, c(s), depending on the number of customers in the system. The gatekeeper, whose objective is to ‘maximize’ rewards, must decide whether to admit the customer. If the customer is accepted, the customer joins the queue and awaits service. Haviv and Puterman [1] showed there can be only two Markovian, stationary, deterministic gain optimal policies and that only the policy which uses the larger control limit is bias optimal. This showed the usefulness of bias optimality to distinguish between gain optimal policies. In the same paper, they conjectured that if the gatekeeper receives the reward upon completion of a job instead of upon entry, the bias optimal policy will be the lower control limit. This note confirms that conjecture.


2008 ◽  
Vol 40 (02) ◽  
pp. 377-400 ◽  
Author(s):  
Savas Dayanik ◽  
Warren Powell ◽  
Kazutoshi Yamazaki

A multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/action-dependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that problem. The Whittle index policy is derived, and its properties are studied. Then it is assumed that the arms may break down, but repair is an option at some cost, and the new Whittle index policy is derived. Both problems are indexable. The proposed index policies cannot be dominated by any other index policy over all multiarmed bandit problems considered here. Whittle indices are evaluated for Bernoulli arms with unknown success probabilities.


Sign in / Sign up

Export Citation Format

Share Document