There is an abundance of real-life situations of the following nature. Several projects, of various types, need to be done by some person, machine, or other device. These require certain amounts of time, while costs (or rewards) are associated with performing such projects, delaying them, or switching between projects. The problem now is: Find a policy that determines, for given numbers of projects of the various types, on which project one should work, so as to minimize (discounted or average) costs. Examples of this problem turn up in administrative settings, in manufacturing, in computer-communications, etc. The problem of dynamically prioritizing projects (traffic classes) may often be formulated as a Markov decision problem, but the state space explosion makes the numerical solution of realistically sized problems usually prohibitive. An important class of heuristic policies, which often provide near-optimal solutions (and sometimes even can be shown to be optimal), is the class of priority index policies. A static index rule employed in minimizing completion costs in single-machine scheduling is the ratio of holding cost rate to processing time: work on the job with highest index. A dynamic index rule has been shown (Gittins and Jones (1974), as referenced in Nino-Mora 2007) to be optimal for the multi-armed bandit problem, viz., sequential allocation of work to a collection of stochastic projects (=bandits), so as to maximize the expected total discounted reward earned over an infinite horizon. Whittle (1988) discusses the multi-armed restless bandit problem, i.e., bandits can change state while being passive, under the long-run average criterion.
Read full abstract