Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

Omar Besbes,Yonatan Gur,Assaf Zeevi

doi:10.1287/stsy.2019.0033

Abstract

In a multi-armed bandit problem, a gambler needs to choose at each round one of K arms, each characterized by an unknown reward distribution. The objective is to maximize cumulative expected earnings over a planning horizon of length T, and performance is measured in terms of regret relative to a (static) oracle that knows the identity of the best arm a priori. This problem has been studied extensively when the reward distributions do not change over time, and uncertainty essentially amounts to identifying the optimal arm. We complement this literature by developing a flexible non-parametric model for temporal uncertainty in the rewards. The extent of temporal uncertainty is measured via the cumulative mean change in the rewards over the horizon, a metric we refer to as temporal variation, and regret is measured relative to a (dynamic) oracle that plays the point-wise optimal action at each period. Assuming that nature can choose any sequence of mean rewards such that their temporal variation does not exceed V (a temporal uncertainty budget), we characterize the complexity of this problem via the minimax regret, which depends on V (the hardness of the problem), the horizon length T, and the number of arms K.

Highlights

The objective is to maximize cumulative expected earnings over a planning horizon of length T, and performance is measured in terms of regret relative to a oracle that knows the identity of the best arm a priori
Assuming that nature can choose any sequence of mean rewards such that their temporal variation does not exceed V, we characterize the complexity of this problem via the minimax regret, which depends on V, the horizon length T, and the number of arms K
In the prototypical multi-armed bandit (MAB) problem, a gambler needs to choose at each round of play t 1, . . . , T one of K arms, each characterized by an unknown reward distribution

Summary

Background and Motivation

In the prototypical multi-armed bandit (MAB) problem, a gambler needs to choose at each round of play t 1, . . . , T one of K arms, each characterized by an unknown reward distribution. An alternative and more pessimistic approach views the MAB problem as a game between the policy designer (gambler) and nature (adversary) in which the latter can change the reward distribution of the arms at every instance of play These ideas date back to the work of Blackwell (1956) and Hannan (1957) and have since seen significant development; Foster and Vohra (1999), Cesa-Bianchi and Lugosi (2006), and Bubeck and Cesa-Bianchi (2012) provide reviews of this line of research. This static oracle can perform quite poorly relative to a dynamic oracle that follows the dynamic optimal sequence of actions because the latter optimizes the (expected) reward at each time instant. a potential limitation of the adversarial framework is that even if a policy exhibits a “small” regret relative to the static oracle, there is no guarantee that it will perform well with respect to the more stringent dynamic oracle

Main Contributions

Further Contrast with Related Work

Problem Formulation

Budget of Variation and Minimax Regret

Analysis of the Minimax Regret

Numerical Results

Experiment 1

Experiment 2

Motivation and Overview

Structure of the Subordinate Policies

Numerical Analysis

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Stochastic Systems	Publication Date: Dec 1, 2019
Citations: 63	License type: cc-by

R Discovery Prime

R Discovery Prime

Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Stochastic Systems

Lead the way for us

Similar Papers

Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards
Omar Besbes ... Assaf Zeevi
SSRN Electronic Journal | VOL. -
Omar Besbes, et. al.Omar Besbes ... Assaf Zeevi
15 May 2014
SSRN Electronic Journal | VOL. -

Achieving complete learning in Multi-Armed Bandit problems
Sattar Vakili ... Qing Zhao
-
Sattar Vakili, et. al.Sattar Vakili ... Qing Zhao
01 Nov 2013
01 Nov 2013

Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems
Sattar Vakili ... Qing Zhao
IEEE Journal of Selected Topics in Signal Processing | VOL. 7
Sattar Vakili, et. al.Sattar Vakili ... Qing Zhao
01 Oct 2013
IEEE Journal of Selected Topics in Signal Processing | VOL. 7

On adaptive estimation for dynamic Bernoulli bandits
Xue Lu ... Nikolas Kantas
Foundations of Data Science | VOL. 1
Xue Lu, et. al.Xue Lu ... Nikolas Kantas
01 Jan 2019
Foundations of Data Science | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Stochastic Systems