Multi-armed bandits based on a variant of Simulated Annealing

Mohammed Shahid Abdulla,Shalabh Bhatnagar

doi:10.1007/s13226-016-0184-5

Abstract

A variant of Simulated Annealing termed Simulated Annealing with Multiplicative Weights (SAMW) has been proposed in the literature. However, convergence was dependent on a parameter β(T), which was calculated a-priori based on the total iterations T the algorithm would run for. We first show the convergence of SAMW even when a diminishing stepsize βk → 1 is used, where k is the index of iteration. Using this SAMW as a kernel, a stochastic multi-armed bandit (SMAB) algorithm called SOFTMIX can be improved to obtain the minimum-possible log regret, as compared to log2 regret of the original. Another modification of SOFTMIX is proposed which avoids the need for a parameter that is dependent on the reward distribution of the arms. Further, a variant of SOFTMIX that uses a comparison term drawn from another popular SMAB algorithm called UCB1 is then described. It is also shown why the proposed scheme is computationally more efficient over UCB1, and an alternative to this algorithm with simpler stepsizes is also proposed. Numerical simulations for all the proposed algorithms are then presented.

Full Text