Abstract

The importance of multi-armed bandit (MAB) problems is on the rise due to their recent application in a large variety of areas such as online advertising, news article selection, wireless networks, and medicinal trials, to name a few. The most common assumption made when solving such MAB problems is that the unknown reward probability θ <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sup> of each bandit arm k is fixed. However, this assumption rarely holds in practice simply because real-life problems often involve underlying processes that are dynamically evolving. In this paper, we model problems where reward probabilities θ <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sup> are drifting, and introduce a new method called Dynamic Thompson Sampling (DTS) that facilitates Order Statistics based Thompson Sampling for these dynamically evolving MABs. The DTS algorithm adapts its success probability estimates, hat θ <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sup> , faster than traditional Thompson Sampling schemes and thus leads to improved performance in terms of lower regret. Extensive experiments demonstrate that DTS outperforms current state-of-the-art approaches, namely pure Thompson Sampling, UCB-Normal and UCB <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</sub> , for the case of dynamic reward probabilities. Furthermore, this performance advantage increases persistently with the number of bandit arms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call