Abstract

The Multi-Armed Bandit (MAB) problem is a well-studied topic within stationary environments, where the reward distributions remain consistent over time. Nevertheless, many real-world applications often fall within non-stationary contexts, where the rewards from each arm can evolve. In light of this, our research focuses on examining and contrasting the effectiveness of two leading algorithms tailored for these shifting environments: the Sliding Window Upper Confidence Bound (SW-UCB) and the Discount Factor UCB (DF-UCB). By harnessing both simulated and real-world datasets, our evaluation encompasses adaptability, computational efficiency, and the potential for regret minimization. Our findings reveal that the SW-UCB is adept at swiftly adjusting to sudden shifts, whereas the DF-UCB emerges as the more resource-efficient option amidst gradual transitions. Notably, when pitted against conventional UCB algorithms within non-stationary contexts, both contenders exhibit substantial advancements. Such insights bear significant relevance to fields like online advertising, healthcare, and finance, where the capacity to nimbly adapt to dynamic environments is paramount.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.