Abstract

The multi-armed bandit is a model of exploration and exploitation, where one must select, within a finite set of arms, the one which maximizes the cumulative reward up to the time horizon T. For the adversarial multi-armed bandit problem, where the sequence of rewards is chosen by an oblivious adversary, the notion of best arm during the time horizon is too restrictive for applications such as ad-serving, where the best ad could change during time range. In this paper, we consider a variant of the adversarial multi-armed bandit problem, where the time horizon is divided into unknown time periods within which rewards are drawn from stochastic distributions. During each time period, there is an optimal arm which may be different from the optimal arm at the previous time period. We present an algorithm taking advantage of the constant exploration of EXP3 to detect when the best arm changes. Its analysis shows that on a run divided into N periods where the best arm changes, the proposed algorithms achieves a regret in O(N √T log T).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.