Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

Sebastian Pilarski,Slawomir Pilarski,Daniel Varro

doi:10.1109/tai.2021.3074122

Sebastian Pilarski, Slawomir Pilarski + Show 1 more

Open Access

https://doi.org/10.1109/tai.2021.3074122

Copy DOI

Abstract

Bernoulli multi-armed bandits are a reinforcement learning model used to study a variety of choice optimization problems. Often such optimizations concern a finite-time horizon. In principle, statistically optimal policies can be computed via dynamic programming, but doing so is considered infeasible due to prohibitive computational requirements and implementation complexity. Hence, suboptimal algorithms are applied in practice, despite their unknown level of suboptimality. In this article, we demonstrate that optimal policies can be efficiently computed for large time horizons or number of arms thanks to a novel memory organization and indexing scheme. We use optimal policies to gauge the suboptimality of several well-known finite- and infinite-time horizon algorithms including Whittle and Gittins indices, epsilon-greedy, Thompson sampling, and upper-confidence bound (UCB) algorithms. Our simulation study shows that all but one evaluated algorithm perform significantly worse than the optimal policy. The Whittle index offers a nearly optimal strategy for multi-armed Bernoulli bandits despite its suboptimal decisions—up to 10%—compared to an optimal policy table. Lastly, we discuss optimizations of known algorithms. We derive a novel solution from UCB1-tuned. It outperforms other infinite-time horizon algorithms when dealing with many arms. <p xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><i>Impact statement</i>—Bernoulli bandits are a reinforcement learning model used to improve decisions with binary outcomes. They have various applications ranging from headline news selection to clinical trials. Existing bandit algorithms are suboptimal. This article provides the first practical computation method, which determines the optimal decisions in Bernoulli bandits. It provides the lowest achievable decision regret (maximum expected benefit). In clinical trials, where an algorithm selects treatments for subsequent patients, our method can substantially reduce the number of unsuccessfully treated patients—by up to 5×. The optimal strategy is also used for new comprehensive evaluations of well-known suboptimal algorithms. This can significantly improve decision effectiveness in various applications.

Highlights

C HOICES must be made when playing games, buying products, and when treating medical patients
Our modification of UCBT significantly improves upon the original algorithm
Question 3: What expected cumulative rewards are achievable with perfect play in the Bernoulli multi-armed bandit?

Summary

Introduction

C HOICES must be made when playing games, buying products, and when treating medical patients. The greater the complexity of outcome, the more difficult finding an optimal strategy must be. In marketing or medicine, such a methodology can greatly decrease the number of purchasing consumers or treated patients during testing. With small testing populations, this may not be the best strategy for maximizing information gain or optimizing the outcome. At the beginning of testing (e.g., news headline testing or clinical trials), the efficacy of each option is unknown. Some test subjects will receive a more click-inducing headline or better medical treatment than others. This is an inevitable price of knowledge acquisition. Achieving the greatest number of article reads or treating the largest number of patients is of the utmost priority

Objectives

Methods

Results

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Artificial Intelligence	Publication Date: Feb 1, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Artificial Intelligence

Lead the way for us

Similar Papers

Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI
Sebastian Pilarski ... Daniel Varro
IEEE Transactions on Artificial Intelligence | VOL. 3
Sebastian Pilarski, et. al.Sebastian Pilarski ... Daniel Varro
01 Apr 2022
IEEE Transactions on Artificial Intelligence | VOL. 3

Performance Comparison of UCB, TS, and -Greedy TS Algorithms through Simulation of Multi-Armed Bandit Machine
Zhuoran Liu
Applied and Computational Engineering | VOL. 83
Zhuoran LiuZhuoran Liu
31 Oct 2024
Applied and Computational Engineering | VOL. 83

Empirical study of Thompson sampling: Tuning the posterior parameters
R Devanand ... P Kumar
-
R Devanand, et. al.R Devanand ... P Kumar
01 Jan 2017
01 Jan 2017

In-depth Exploration and Implementation of Multi-Armed Bandit Models Across Diverse Fields
Jiazhen Wu
Highlights in Science, Engineering and Technology | VOL. 94
Jiazhen WuJiazhen Wu
26 Apr 2024
Highlights in Science, Engineering and Technology | VOL. 94

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Artificial Intelligence