The colorfully-named and much-studied multi-armed bandit is the following Markov decision problem: At epochs 1, 2, ... , a decision maker observes the current state of each of several Markov chains with rewards (bandits) and plays one of them. The Markov chains that are not played remain in their current states. The Markov chain that is played evolves for one transition according to its transition probabilities, earning an immediate reward (possibly negative) that can depend upon its current state and on the state to which transition occurs. Henceforth, to distinguish the states of the individual Markov chains from those of the Markov decision problem, the latter are called multi-states. Each multi-state prescribes a state for each of the Markov chains. This version of the multi-armed bandit problem was originally solved by John Gittins. It has a large range of operations research applications including applications to resource allocation, scheduling, project management, and search. A key result for the multi-armed bandit is that attention can be restricted to a simple class of decision procedures. A label is assigned to each state of each bandit such that no two states have the same label, even if they are in different bandits. A priority rule is a policy that, given each multistate, plays the Markov chain whose current state has the lowest label. The literature includes several different proofs of the optimality of a priority rule. Nearly all of these proofs rest on a family of optimal stopping times, one for each state of each bandit. A different approach is taken here. Pair-wise comparison, rather than optimal stopping, is used to demonstrate the optimality of a priority rule. This is accomplished for models having linear and exponential utility functions. Elementary row operations are used to identify an optimal priority rule and to compute its expected utility for a given starting state. Our analysis covers the cases of linear and exponential utilities. In the case of a linear utility function, the model is generalized to include constraints that link the bandits. With C constraints, an optimal policy is shown to take the form of an initial randomization over C + 1 priority rules, and column generation is proposed as a solution method. The proposed computational methods are based on several matrix algorithms. First, an algorithm, called the Triangularizer, transforms the one-step rewards and transition probability matrixes for individual bandits by applying elementary row operations. The transformed matrixes, called finalized, are triangle: all their elements on diagonals and below diagonals are equal to zero. For a given index policy, running the transformed bandits is equivalent to running the original bandits. Second, the transition probabilities and one-step rewards of the transformed bandits are used to compute the performance characteristics of index policies in polynomial times. These computations are used by the column generation algorithm for multi-armed bandits with constraints.
Read full abstract