Solving two‐armed Bernoulli bandit problems using a Bayesian learning automaton

Ole‐Christoffer Granmo

doi:10.1108/17563781011049179

Abstract

PurposeThe two‐armed Bernoulli bandit (TABB) problem is a classical optimization problem where an agent sequentially pulls one of two arms attached to a gambling machine, with each pull resulting either in a reward or a penalty. The reward probabilities of each arm are unknown, and thus one must balance between exploiting existing knowledge about the arms, and obtaining new information. The purpose of this paper is to report research into a completely new family of solution schemes for the TABB problem: the Bayesian learning automaton (BLA) family.Design/methodology/approachAlthough computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. BLA avoids the problem of computational intractability by not explicitly performing the Bayesian computations. Rather, it is based upon merely counting rewards/penalties, combined with random sampling from a pair of twin Beta distributions. This is intuitively appealing since the Bayesian conjugate prior for a binomial parameter is the Beta distribution.FindingsBLA is to be proven instantaneously self‐correcting, and it converges to only pulling the optimal arm with probability as close to unity as desired. Extensive experiments demonstrate that the BLA does not rely on external learning speed/accuracy control. It also outperforms established non‐Bayesian top performers for the TABB problem. Finally, the BLA provides superior performance in a distributed application, namely, the Goore game (GG).Originality/valueThe value of this paper is threefold. First of all, the reported BLA takes advantage of the Bayesian perspective for tackling TABBs, yet avoids the computational complexity inherent in Bayesian approaches. Second, the improved performance offered by the BLA opens up for increased accuracy in a number of TABB‐related applications, such as the GG. Third, the reported results form the basis for a new avenue of research – even for cases when the reward/penalty distribution is not Bernoulli distributed. Indeed, the paper advocates the use of a Bayesian methodology, used in conjunction with the corresponding appropriate conjugate prior.

Full Text