Abstract
The Upper Confidence Bound (UCB) algorithm is a widely used approach in the Multi-Armed Bandit (MAB) problem, where the goal is to maximize cumulative rewards over time by selecting the best possible action among several options. The UCB algorithm uses confidence bounds to balance exploration and exploitation, guiding its decision-making. In recent years, researchers have identified significant challenges in applying the UCB algorithm to dynamic and contaminated environments. In such scenarios, the underlying conditions may change over time, making it difficult for standard UCB to adapt, or the data may be polluted by noise and outliers, leading to incorrect estimations of reward distributions. To address these challenges, several variants of the UCB algorithm have been developed. These new approaches are designed to better handle the complexities of changing environments and data contamination, ensuring more robust and reliable performance in these difficult settings. This paper aims to provide a comprehensive review of Robust-UCB (cr-UCB), Sliding Window UCB (SW-UCB) and bandit-over-bandit UCB (BOB-UCB). Focusing on their theoretical foundations, practical applications, and empirical performance. By examining how these algorithms have been adapted to handle the complexities of dynamic and contaminated environments, we found that the adaptability of these algorithms in dynamic environments is significantly improved, and they can effectively reduce decision-making errors caused by data pollution, thus providing a more reliable solution to the multi-armed bandit problem.
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have