Analysis of Rewards in Bernoulli Bandits Using Martingales

Clement Leung,Longjun Hao

doi:10.1109/aike48582.2020.00015

Abstract

Bernoulli bandits have found to mirror many practical situations in the context of reinforcement learning, and the aim is to maximize rewards through playing the machine over a set time frame. In an actual casino setting, it is often unrealistic to fix the time when playing stops, as the termination of play may be random and dependent on the outcomes of earlier lever pulls, which in turn affects the inclination of the gambler to continue playing. It is often assumed that exploration is repeated each time the game is played, and that the game tend to go on indefinitely. In practical situations, if the casino does not change their machines often, exploration need not be carried out repeatedly as this would be inefficient. Moreover, from the gamblers' point of view, they would likely to stop at some point or when certain conditions are fulfilled. Here, the bandit problem is studied in terms of stopping rules which are dependent on earlier random outcomes and on the behavior of the players. Rewards incorporating the cost of play and the size of payouts are then calculated on the conclusion of a playing episode. Here, the rewards for Bernoulli machines are placed within the context of martingales that are commonly used in gambling situations, and the fairness of the game is expressed through the parameters of the bandit machines which can be manifested as various forms of martingales. The average rewards and regrets as well as episode durations are obtained under different martingale stopping times. Exploration costs and regrets for different bandit machines are analyzed. Experimentation has also been undertaken which corroborate the theoretical results.

Full Text