Abstract

We consider the sequential resource allocation problem under the multi-armed bandit model in the non-stationary stochastic environment. Motivated by many real applications, where information can naturally be grouped, we consider a variation of the contextual multi-armed bandit with online clustering representing side information. We assume a stochastic environment in which the reward of each action, conditioned on a cluster, follows a Bernoulli distribution with unknown parameters. Additionally, we assume that the nature of the problem changes over time and the clusters drift incrementally, making the reward process non-stationary. In this setting, we propose a new algorithm based on a two-stage approach. The first stage is a sequential modification of the traditional k-means clustering algorithm, in which the algorithm deals with the continuous data stream and acts on a subset of data rather than a single batch. In the second stage, we incorporate the current information about clusters into the Thompson Sampling policy with discounting mechanism to track changes in the underlying reward and account for a potential cluster misclassification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.