Abstract

SUMMARY Earlier work by the present authors has established the existence of and a characterization of a priority index giving the Bayes rule for the discounted multiarmed bandit problem. The calculation of this index is described and illustrated, and the results obtained briefly discussed. The two-armed bandit problem is so-called because it models the situation faced by a gambler using a fruit machine with two arms, instead of just one. When an arm is pulled the result is that the gambler either wins a prize or not. All the prizes are of the same value, and for each arm there is a certain constant, and in general unknown, probability of success every time it is pulled, which is different for the two arms. The gambler's problem is to choose a sequence of pulls on the two arms, which depends in a sequential manner on the record of successes and failures, in such a fashion as to maximize his expected total gains. In one version of the problem the gambler is allowed a fixed number of pulls in total. Alternatively we may consider a discounted version of the problem for which the value of a prize received at the tth pull, irrespective of the arms pulled on the previous (t - 1) pulls, is multiplied by at-1, for some a < 1. Multiarmed bandit problems are similar, but with more than two arms. Their chief practical motivation comes from clinical trials, though they are also of interest as probably the simplest worthwhile set of problems in the sequential design of experiments. Recent numerical investigation of Bayesian rules for the two-armed bandit problem with independent arms by P. W. Jones (1975) and Wahrenberger, Antle & Klimko (1977) have shown significant improvements when compared with other rules which have been proposed. However, the calculation of these rules is costly in terms of computer storage and time, and the reported results are all for a total number of trials which does not exceed 50. The present authors (1974) showed that for the discounted case with an infinite number of trials, first considered by Bellman (1956), the Bayes rule is given by a function which, for each arm, depends on the posterior distribution for the unknown success probability. The Bayes rule is always to pull the arm for which the current value of the function is larger, for which reason the function was termed a dynamic allocation index. Moreover, this result holds for the multiarmed bandit problem. Gittins (1979) obtained a characterization of the dynamic allocation index which actually holds for a range of problems of which the multiarmed bandit is just one example. The present paper reports on a numerical investigation of the dynamic allocation index using this characterization, which shows that some further

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call