The multi-armed bandit, with constraints

Eric V Denardo,Eugene A Feinberg,Uriel G Rothblum

doi:10.1145/2185395.2185430

Abstract

The colorfully-named and much-studied multi-armed bandit is the following Markov decision problem: At epochs 1, 2, ... , a decision maker observes the current state of each of several Markov chains with rewards (bandits) and plays one of them. The Markov chains that are not played remain in their current states. The Markov chain that is played evolves for one transition according to its transition probabilities, earning an immediate reward (possibly negative) that can depend upon its current state and on the state to which transition occurs. Henceforth, to distinguish the states of the individual Markov chains from those of the Markov decision problem, the latter are called multi-states. Each multi-state prescribes a state for each of the Markov chains. This version of the multi-armed bandit problem was originally solved by John Gittins. It has a large range of operations research applications including applications to resource allocation, scheduling, project management, and search. A key result for the multi-armed bandit is that attention can be restricted to a simple class of decision procedures. A label is assigned to each state of each bandit such that no two states have the same label, even if they are in different bandits. A priority rule is a policy that, given each multistate, plays the Markov chain whose current state has the lowest label. The literature includes several different proofs of the optimality of a priority rule. Nearly all of these proofs rest on a family of optimal stopping times, one for each state of each bandit. A different approach is taken here. Pair-wise comparison, rather than optimal stopping, is used to demonstrate the optimality of a priority rule. This is accomplished for models having linear and exponential utility functions. Elementary row operations are used to identify an optimal priority rule and to compute its expected utility for a given starting state. Our analysis covers the cases of linear and exponential utilities. In the case of a linear utility function, the model is generalized to include constraints that link the bandits. With C constraints, an optimal policy is shown to take the form of an initial randomization over C + 1 priority rules, and column generation is proposed as a solution method. The proposed computational methods are based on several matrix algorithms. First, an algorithm, called the Triangularizer, transforms the one-step rewards and transition probability matrixes for individual bandits by applying elementary row operations. The transformed matrixes, called finalized, are triangle: all their elements on diagonals and below diagonals are equal to zero. For a given index policy, running the transformed bandits is equivalent to running the original bandits. Second, the transition probabilities and one-step rewards of the transformed bandits are used to compute the performance characteristics of index policies in polynomial times. These computations are used by the column generation algorithm for multi-armed bandits with constraints.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The multi-armed bandit, with constraints

Abstract

Talk to us

Similar Papers

More From: ACM SIGMETRICS Performance Evaluation Review

Lead the way for us

Journal: ACM SIGMETRICS Performance Evaluation Review	Publication Date: Mar 9, 2012
Citations: 11

Similar Papers

The multi-armed bandit, with constraints
Eric V. Denardo ... Uriel G. Rothblum
Annals of Operations Research | VOL. 208
Eric V. Denardo, et. al.Eric V. Denardo ... Uriel G. Rothblum
13 Nov 2012
Annals of Operations Research | VOL. 208

Detecting an Odd Restless Markov Arm With a Trembling Hand
P N Karthik ... Rajesh Sundaresan
IEEE Transactions on Information Theory | VOL. 67
P N Karthik, et. al.P N Karthik ... Rajesh Sundaresan
24 Apr 2021
IEEE Transactions on Information Theory | VOL. 67

Chapter 3 - Markov Chains: Introduction
Howard M Taylor ... Samuel Karlin
An Introduction to Stochastic Modeling | VOL. -
Howard M Taylor, et. al.Howard M Taylor ... Samuel Karlin
01 Jan 1993
An Introduction to Stochastic Modeling | VOL. -

Contraction Mappings in the Theory Underlying Dynamic Programming
Eric V Denardo
SIAM Review | VOL. 9
Eric V DenardoEric V Denardo
01 Apr 1967
SIAM Review | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The multi-armed bandit, with constraints

Abstract

Talk to us

Similar Papers

More From: ACM SIGMETRICS Performance Evaluation Review