Bernoulli Bandit Research Articles

Bernoulli multi-armed bandits are a reinforcement learning model used to study a variety of choice optimization problems. Often such optimizations concern a finite-time horizon. In principle, statistically optimal policies can be computed via dynamic programming, but doing so is considered infeasible due to prohibitive computational requirements and implementation complexity. Hence, suboptimal algorithms are applied in practice, despite their unknown level of suboptimality. In this article, we demonstrate that optimal policies can be efficiently computed for large time horizons or number of arms thanks to a novel memory organization and indexing scheme. We use optimal policies to gauge the suboptimality of several well-known finite- and infinite-time horizon algorithms including Whittle and Gittins indices, epsilon-greedy, Thompson sampling, and upper-confidence bound (UCB) algorithms. Our simulation study shows that all but one evaluated algorithm perform significantly worse than the optimal policy. The Whittle index offers a nearly optimal strategy for multi-armed Bernoulli bandits despite its suboptimal decisions—up to 10%—compared to an optimal policy table. Lastly, we discuss optimizations of known algorithms. We derive a novel solution from UCB1-tuned. It outperforms other infinite-time horizon algorithms when dealing with many arms. <p xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><i>Impact statement</i>—Bernoulli bandits are a reinforcement learning model used to improve decisions with binary outcomes. They have various applications ranging from headline news selection to clinical trials. Existing bandit algorithms are suboptimal. This article provides the first practical computation method, which determines the optimal decisions in Bernoulli bandits. It provides the lowest achievable decision regret (maximum expected benefit). In clinical trials, where an algorithm selects treatments for subsequent patients, our method can substantially reduce the number of unsuccessfully treated patients—by up to 5×. The optimal strategy is also used for new comprehensive evaluations of well-known suboptimal algorithms. This can significantly improve decision effectiveness in various applications.

PurposeThe two‐armed Bernoulli bandit (TABB) problem is a classical optimization problem where an agent sequentially pulls one of two arms attached to a gambling machine, with each pull resulting either in a reward or a penalty. The reward probabilities of each arm are unknown, and thus one must balance between exploiting existing knowledge about the arms, and obtaining new information. The purpose of this paper is to report research into a completely new family of solution schemes for the TABB problem: the Bayesian learning automaton (BLA) family.Design/methodology/approachAlthough computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. BLA avoids the problem of computational intractability by not explicitly performing the Bayesian computations. Rather, it is based upon merely counting rewards/penalties, combined with random sampling from a pair of twin Beta distributions. This is intuitively appealing since the Bayesian conjugate prior for a binomial parameter is the Beta distribution.FindingsBLA is to be proven instantaneously self‐correcting, and it converges to only pulling the optimal arm with probability as close to unity as desired. Extensive experiments demonstrate that the BLA does not rely on external learning speed/accuracy control. It also outperforms established non‐Bayesian top performers for the TABB problem. Finally, the BLA provides superior performance in a distributed application, namely, the Goore game (GG).Originality/valueThe value of this paper is threefold. First of all, the reported BLA takes advantage of the Bayesian perspective for tackling TABBs, yet avoids the computational complexity inherent in Bayesian approaches. Second, the improved performance offered by the BLA opens up for increased accuracy in a number of TABB‐related applications, such as the GG. Third, the reported results form the basis for a new avenue of research – even for cases when the reward/penalty distribution is not Bernoulli distributed. Indeed, the paper advocates the use of a Bayesian methodology, used in conjunction with the corresponding appropriate conjugate prior.

Bernoulli Bandit Research Articles

Articles published on Bernoulli Bandit

Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI

Energy-Delay-Aware Power Control for Reliable Transmission of Dynamic Cell-Free Massive MIMO

Adaptive Channel Hopping for IEEE 802.15.4 TSCH-Based Networks: A Dynamic Bernoulli Bandit Approach

Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

On adaptive estimation for dynamic Bernoulli bandits

A conjecture on the Feldman bandit problem

Finite budget analysis of multi-armed bandit problems

Optimistic Bayesian sampling in contextual-bandit problems

Solving two‐armed Bernoulli bandit problems using a Bayesian learning automaton

Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research

Simple Models of Discrete Choice and Their Performance in Bandit Experiments

A note on infinite-armed Bernoulli bandit problems with generalized beta prior distributions

Evaluation of asymptotic approximations for a two-stage Bernoulli bandit

On a theorem of Kelley

Stay-with-a-Winner Rule for Dependent Bernoulli Bandits

Worth of perfect information in bernoulli bandits

Worth of perfect information in bernoulli bandits

Covariate models for bernoulli bandits

On the k-armed Bernoulli bandit: monotonicity of the total reward under an arbitrary prior distribution

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Bernoulli Bandit Research Articles

Articles published on Bernoulli Bandit

Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI

Energy-Delay-Aware Power Control for Reliable Transmission of Dynamic Cell-Free Massive MIMO

Adaptive Channel Hopping for IEEE 802.15.4 TSCH-Based Networks: A Dynamic Bernoulli Bandit Approach

Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

On adaptive estimation for dynamic Bernoulli bandits

A conjecture on the Feldman bandit problem

Finite budget analysis of multi-armed bandit problems

Optimistic Bayesian sampling in contextual-bandit problems

Solving two‐armed Bernoulli bandit problems using a Bayesian learning automaton

Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research

Simple Models of Discrete Choice and Their Performance in Bandit Experiments

A note on infinite-armed Bernoulli bandit problems with generalized beta prior distributions

Evaluation of asymptotic approximations for a two-stage Bernoulli bandit

On a theorem of Kelley

Stay-with-a-Winner Rule for Dependent Bernoulli Bandits

Worth of perfect information in bernoulli bandits

Worth of perfect information in bernoulli bandits

Covariate models for bernoulli bandits

On the k-armed Bernoulli bandit: monotonicity of the total reward under an arbitrary prior distribution