Empirical evaluation on discounted Thompson sampling for multi-armed bandit problem with piecewise-stationary Bernoulli arms

F C Asyuraa,S Abdullah,T E Sutanto

doi:10.1088/1742-6596/1722/1/012096

Abstract

The Multi-Armed Bandit problem is a problem in reinforcement learning that focuses on how to solve the exploration-exploitation dilemma. The dilemma is, given a set of options (known as arms) that you could try many times, how to balance between gathering information through experimenting with the available arms (exploration) or maximizing profit by choosing the seemingly best arm at that time (exploitation). The Multi-Armed Bandit problem is centered around determining which arm to choose at every round. Multi-Armed Bandit has gained its popularity as a more dynamic approach to a randomized trial, with its goal is to experiment with each available arm while still maximizing profit gained. An example of Multi-Armed Bandit in real life is in determining which film artwork should be shown to a visitor that would attract the visitor to watch that particular film. Bernoulli distribution with parameter θ is chosen to model the response of the visitor after seeing the artwork. Non-stationary condition on θ can be implemented to accommodate various trends in film artworks. Some artworks might be good at a certain month, but they could not be preferred in the next month. The non-stationary condition in this study is modeled through piecewise-stationary. We implemented a discounted Thompson sampling policy that used Bayesian method to determine which arm to choose at each round. Multiple simulations were conducted on various conditions to empirically test the policy’s performance on various conditions. Evaluation was based on the cumulative regret. Based on these simulations, discounted Thompson sampling policy achieved relatively lower cumulative regret in tackling the stationary and piecewise-stationary conditions, compared to some well-known policies such as Epsilon Greedy, SoftMax, Upper Confidence Bound, and Thompson Sampling.

Full Text