Multi-armed Bandit Research Articles

Multi-Armed Bandit (MAB) algorithms are classic algorithms that address sequential decision-making under uncertainty by solving the exploration-exploitation trade-off dilemma. This study investigates the performance comparison of multiple MAB algorithms using a simulated multi-armed bandit machine with Bernoulli reward distribution as the experimental environment. This study compares the performance differences among Upper Confidence Bound (UCB), Thompson Sampling (TS) and -Greedy Thompson Sampling (-TS) in this environment, and attempts to set different parameters, namely the number of arms and the number of experimental rounds and calculate the corresponding cumulative regret of the algorithm under various conditions. The size of the cumulative regret reflects the performance of each algorithm in the simulated slot machine model with rewards that conform to the Bernoulli distribution. In addition, the algorithm running time under the same conditions is also recorded to analyze from the perspective of algorithm efficiency. The experimental results show that under the experimental environment of this study, the cumulative regret produced by the UCB algorithm is more than three times that of the other two algorithms. When the number of trials is small, the cumulative regret generated by the TS algorithm is small, but overall, the performance of the TS algorithm and the -TS algorithm set in this experiment in minimizing the cumulative regret is not much different. However, TS runs in a shorter time under the same conditions. The results of this experiment show that after the number of rounds of experimental operation reaches a large enough number, the operating efficiency of the TS algorithm will be significantly higher than that of the -TS algorithm. TS algorithms have higher randomness, so they show better performance under this experimental condition. The -TS under the parameter setting of this study encourages exploration more in the early stage of the experiment, so it will produce greater cumulative regret than the traditional TS algorithm. In the long run, the performance difference between the two in the multi-armed bandit problem with Bernoulli distribution of rewards is very small. However, the TS algorithm has more advantages in algorithm execution efficiency, so when solving similar problems, the TS algorithm is a better choice.

We study the stochastic multi-armed bandit problem and design new policies that enjoy both optimal regret expectation and light-tailed risk for regret distribution. We first find that any policy that obtains the optimal instance-dependent expected regret could incur a heavy-tailed regret tail risk that decays slowly with T. We then focus on policies that achieve optimal worst-case expected regret. We design a novel policy that (i) enjoys the worst-case optimality for regret expectation and (ii) has the worst-case tail probability of incurring a regret larger than any regret threshold that decays exponentially with respect to T. The decaying rate is proved to be optimal for all worst-case optimal policies. Our proposed policy achieves a delicate balance between doing more exploration at the beginning of the time horizon and doing more exploitation when approaching the end, compared with standard confidence-bound-based policies. We also enhance the policy design to accommodate the “any-time” setting where T is unknown a priori, highlighting “lifelong exploration”, and prove equivalently desired policy performances as compared with the “fixed-time” setting with known T. From a managerial perspective, we show through numerical experiments that our new policy design yields similar efficiency and better safety compared to celebrated policies. Our policy design is preferable especially when (i) there is a risk of underestimating the volatility profile, or (ii) there is a challenge of tuning policy hyper-parameters. We conclude by extending our proposed policy design to the stochastic linear bandit setting that leads to both worst-case optimality in terms of regret expectation and light-tailed risk on regret distribution. This paper was accepted by J. George Shanthikumar, data science. Funding: The work of D. Simchi-Levi and F. Zhu is partially supported by the MIT Data Science Laboratory. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2022.03512 .

Multi-armed Bandit Research Articles

Related Topics

Articles published on Multi-armed Bandit

A Truthful Pricing-Based Defending Strategy Against Adversarial Attacks in Budgeted Combinatorial Multi-Armed Bandits

Dual Moral Hazard and the Tyranny of Success

Performance Comparison of UCB, TS, and -Greedy TS Algorithms through Simulation of Multi-Armed Bandit Machine

Applications and Advances of UCB Algorithms in Dynamic and Contaminated Environments

On the Performance of the Minimax Optimal Strategy in the Stochastic Case of Logistic Bandits

Advanced MiniMax Optimality Strategy in Bandit Problems

A Strategy for Advertisement Placement based on the Multi-Armed Tiger Problem

Analysis of the Effectiveness of Multi-Armed Bandit Algorithms in Crop Pricing

Thompson sampling for multi-armed bandits in big data environments

The Investigation of Progress and Application in the Multi-Armed Bandit Algorithm

UCB and Thompson Sampling Algorithms in Long-Term Investing

A Simple and Optimal Policy Design with Safety Against Heavy-Tailed Risk for Stochastic Bandits

Interpreting pretext tasks for active learning: a reinforcement learning approach

An online frequency allocation strategy for multi‐carrier radar against spot jammer

Strategy Selection Using Multi-Armed Bandit Algorithms in Financial Markets

Comparison of Multi-Armed Bandit Algorithms in Advertising Recommendation Systems

Do More Experienced Gamblers Choose Slot Machines with Better Odds? A Large-Scale Multi-Armed Bandit Problem at a Casino

Multi-armed bandit games

Harnessing Multi-Armed Bandits for Smarter Digital Marketing Decisions

Harvesting heterogeneity: Selective expertise versus machine learning.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multi-armed Bandit Research Articles

Related Topics

Articles published on Multi-armed Bandit

A Truthful Pricing-Based Defending Strategy Against Adversarial Attacks in Budgeted Combinatorial Multi-Armed Bandits

Dual Moral Hazard and the Tyranny of Success

Performance Comparison of UCB, TS, and -Greedy TS Algorithms through Simulation of Multi-Armed Bandit Machine

Applications and Advances of UCB Algorithms in Dynamic and Contaminated Environments

On the Performance of the Minimax Optimal Strategy in the Stochastic Case of Logistic Bandits

Advanced MiniMax Optimality Strategy in Bandit Problems

A Strategy for Advertisement Placement based on the Multi-Armed Tiger Problem

Analysis of the Effectiveness of Multi-Armed Bandit Algorithms in Crop Pricing

Thompson sampling for multi-armed bandits in big data environments

The Investigation of Progress and Application in the Multi-Armed Bandit Algorithm

UCB and Thompson Sampling Algorithms in Long-Term Investing

A Simple and Optimal Policy Design with Safety Against Heavy-Tailed Risk for Stochastic Bandits

Interpreting pretext tasks for active learning: a reinforcement learning approach

An online frequency allocation strategy for multi‐carrier radar against spot jammer

Strategy Selection Using Multi-Armed Bandit Algorithms in Financial Markets

Comparison of Multi-Armed Bandit Algorithms in Advertising Recommendation Systems

Do More Experienced Gamblers Choose Slot Machines with Better Odds? A Large-Scale Multi-Armed Bandit Problem at a Casino

Multi-armed bandit games

Harnessing Multi-Armed Bandits for Smarter Digital Marketing Decisions

Harvesting heterogeneity: Selective expertise versus machine learning.