Abstract
We consider the sequential resource allocation problem under the multi-armed bandit model in the non-stationary stochastic environment. Motivated by many real applications, where information can naturally be grouped, we consider a variation of the contextual multi-armed bandit with online clustering representing side information. We assume a stochastic environment in which the reward of each action, conditioned on a cluster, follows a Bernoulli distribution with unknown parameters. Additionally, we assume that the nature of the problem changes over time and the clusters drift incrementally, making the reward process non-stationary. In this setting, we propose a new algorithm based on a two-stage approach. The first stage is a sequential modification of the traditional k-means clustering algorithm, in which the algorithm deals with the continuous data stream and acts on a subset of data rather than a single batch. In the second stage, we incorporate the current information about clusters into the Thompson Sampling policy with discounting mechanism to track changes in the underlying reward and account for a potential cluster misclassification.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have