Stochastic Multi-armed Bandit Problem Research Articles

We study the stochastic multi-armed bandit problem and design new policies that enjoy both optimal regret expectation and light-tailed risk for regret distribution. We first find that any policy that obtains the optimal instance-dependent expected regret could incur a heavy-tailed regret tail risk that decays slowly with T. We then focus on policies that achieve optimal worst-case expected regret. We design a novel policy that (i) enjoys the worst-case optimality for regret expectation and (ii) has the worst-case tail probability of incurring a regret larger than any regret threshold that decays exponentially with respect to T. The decaying rate is proved to be optimal for all worst-case optimal policies. Our proposed policy achieves a delicate balance between doing more exploration at the beginning of the time horizon and doing more exploitation when approaching the end, compared with standard confidence-bound-based policies. We also enhance the policy design to accommodate the “any-time” setting where T is unknown a priori, highlighting “lifelong exploration”, and prove equivalently desired policy performances as compared with the “fixed-time” setting with known T. From a managerial perspective, we show through numerical experiments that our new policy design yields similar efficiency and better safety compared to celebrated policies. Our policy design is preferable especially when (i) there is a risk of underestimating the volatility profile, or (ii) there is a challenge of tuning policy hyper-parameters. We conclude by extending our proposed policy design to the stochastic linear bandit setting that leads to both worst-case optimality in terms of regret expectation and light-tailed risk on regret distribution. This paper was accepted by J. George Shanthikumar, data science. Funding: The work of D. Simchi-Levi and F. Zhu is partially supported by the MIT Data Science Laboratory. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2022.03512 .

Read full abstract

This brief studies a variation of the stochastic multiarmed bandit (MAB) problems, where the agent knows the a priori knowledge named the near-optimal mean reward (NoMR). In common MAB problems, an agent tries to find the optimal arm without knowing the optimal mean reward. However, in more practical applications, the agent can usually get an estimation of the optimal mean reward defined as NoMR. For instance, in an online Web advertising system based on MAB methods, a user's near-optimal average click rate (NoMR) can be roughly estimated from his/her demographic characteristics. As a result, application of the NoMR is efficient at improving the algorithm's performance. First, we formalize the stochastic MAB problem by knowing the NoMR that is in between the suboptimal mean reward and the optimal mean reward. Second, we use the cumulative regret as the performance metric for our problem, and we get that this problem's lower bound of the cumulative regret is Ω(1/∆) , where ∆ is the difference between the suboptimal mean reward and the optimal mean reward. Compared with the conventional MAB problem with the increasing logarithmic lower bound of the regret, our regret lower bound is uniform with the learning step. Third, a novel algorithm, NoMR-BANDIT, is set forth to solve this problem. In NoMR-BANDIT, the NoMR is used to design an efficient exploration strategy. In addition, we analyzed the regret's upper bound in NoMR-BANDIT and concluded that it also has a uniform upper bound of O(1/∆) , which is in the same order as the lower bound. Consequently, NoMR-BANDIT is an optimal algorithm of this problem. To enhance our method's generalization, CASCADE-BANDIT based on NoMR-BANDIT is proposed to solve the problem, where NoMR is less than the suboptimal mean reward. CASCADE-BANDIT has an upper bound of O(∆logn) , where n represents the learning step, and the order of O(∆logn) is the same with that of the conventional MAB methods. Finally, extensive experimental results demonstrated that the established NoMR-BANDIT is more efficient than the compared bandit solutions. After sufficient iterations, NOMR-BANDIT saved 10%-80% more cumulative regret than the state of the art.

Read full abstract

Stochastic Multi-armed Bandit Problem Research Articles

Related Topics

Articles published on Stochastic Multi-armed Bandit Problem

A Simple and Optimal Policy Design with Safety Against Heavy-Tailed Risk for Stochastic Bandits

The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models

Maximal Objectives in the Multiarmed Bandit with Applications

Model-Based Thompson Sampling for Frequency and Rate Selection in Underwater Acoustic Communications

Phase Transitions in Bandits with Switching Constraints

Learning Equilibria in Matching Markets with Bandit Feedback

A General Framework for Bandit Problems Beyond Cumulative Objectives

Multiplayer Bandits Without Observing Collision Information

Minimax Policy for Heavy-Tailed Bandits

DUCT: An Upper Confidence Bound Approach to Distributed Constraint Optimization Problems

Robust risk-averse multi-armed bandits with application in social engagement behavior of children with autism spectrum disorder while imitating a humanoid robot

An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.

Non-stationary Stochastic Multi-armed Bandit Problems with External Information on Stationarity

Learning Algorithms for Minimizing Queue Length Regret

Waiting But Not Aging: Optimizing Information Freshness Under the Pull Model

Multi-Armed Bandits on Partially Revealed Unit Interval Graphs

Achieving Fairness in the Stochastic Multi-Armed Bandit Problem

Observe Before Play: Multi-Armed Bandit with Pre-Observations

Fast mmwave Beam Alignment via Correlated Bandit Learning

Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Stochastic Multi-armed Bandit Problem Research Articles

Related Topics

Articles published on Stochastic Multi-armed Bandit Problem

A Simple and Optimal Policy Design with Safety Against Heavy-Tailed Risk for Stochastic Bandits

The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models

Maximal Objectives in the Multiarmed Bandit with Applications

Model-Based Thompson Sampling for Frequency and Rate Selection in Underwater Acoustic Communications

Phase Transitions in Bandits with Switching Constraints

Learning Equilibria in Matching Markets with Bandit Feedback

A General Framework for Bandit Problems Beyond Cumulative Objectives

Multiplayer Bandits Without Observing Collision Information

Minimax Policy for Heavy-Tailed Bandits

DUCT: An Upper Confidence Bound Approach to Distributed Constraint Optimization Problems

Robust risk-averse multi-armed bandits with application in social engagement behavior of children with autism spectrum disorder while imitating a humanoid robot

An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward.

Non-stationary Stochastic Multi-armed Bandit Problems with External Information on Stationarity

Learning Algorithms for Minimizing Queue Length Regret

Waiting But Not Aging: Optimizing Information Freshness Under the Pull Model

Multi-Armed Bandits on Partially Revealed Unit Interval Graphs

Achieving Fairness in the Stochastic Multi-Armed Bandit Problem

Observe Before Play: Multi-Armed Bandit with Pre-Observations

Fast mmwave Beam Alignment via Correlated Bandit Learning

Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints