Abstract
In a multi-armed bandit problem, a gambler needs to choose at each round one of K arms, each characterized by an unknown reward distribution. The objective is to maximize cumulative expected earnings over a planning horizon of length T, and performance is measured in terms of regret relative to a (static) oracle that knows the identity of the best arm a priori. This problem has been studied extensively when the reward distributions do not change over time, and uncertainty essentially amounts to identifying the optimal arm. We complement this literature by developing a flexible non-parametric model for temporal uncertainty in the rewards. The extent of temporal uncertainty is measured via the cumulative mean change in the rewards over the horizon, a metric we refer to as temporal variation, and regret is measured relative to a (dynamic) oracle that plays the point-wise optimal action at each period. Assuming that nature can choose any sequence of mean rewards such that their temporal variation does not exceed V (a temporal uncertainty budget), we characterize the complexity of this problem via the minimax regret, which depends on V (the hardness of the problem), the horizon length T, and the number of arms K.
Highlights
The objective is to maximize cumulative expected earnings over a planning horizon of length T, and performance is measured in terms of regret relative to a oracle that knows the identity of the best arm a priori
Assuming that nature can choose any sequence of mean rewards such that their temporal variation does not exceed V, we characterize the complexity of this problem via the minimax regret, which depends on V, the horizon length T, and the number of arms K
In the prototypical multi-armed bandit (MAB) problem, a gambler needs to choose at each round of play t 1, . . . , T one of K arms, each characterized by an unknown reward distribution
Summary
In the prototypical multi-armed bandit (MAB) problem, a gambler needs to choose at each round of play t 1, . . . , T one of K arms, each characterized by an unknown reward distribution. An alternative and more pessimistic approach views the MAB problem as a game between the policy designer (gambler) and nature (adversary) in which the latter can change the reward distribution of the arms at every instance of play These ideas date back to the work of Blackwell (1956) and Hannan (1957) and have since seen significant development; Foster and Vohra (1999), Cesa-Bianchi and Lugosi (2006), and Bubeck and Cesa-Bianchi (2012) provide reviews of this line of research. This static oracle can perform quite poorly relative to a dynamic oracle that follows the dynamic optimal sequence of actions because the latter optimizes the (expected) reward at each time instant. a potential limitation of the adversarial framework is that even if a policy exhibits a “small” regret relative to the static oracle, there is no guarantee that it will perform well with respect to the more stringent dynamic oracle
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.