Abstract
The multi-armed bandit (MAB) problem is classic problem of the exploration versus exploitation dilemma in reinforcement learning. As an archetypal MAB problem, the stochastic multi-armed bandit (SMAB) problem is the base of many new MAB problems. To solve the problems of weak theoretical analysis and single information used in existing SMAB methods, this paper presents "the Chosen Number of Arm with Minimal Value" (CNAMV), a method for balancing exploration and exploitation adaptively. Theoretically, the upper bound of CNAMV’s regret is proved, that is the loss due to the fact that the globally optimal policy is not followed all the times. Experimental results show that CNAMV yields greater reward and smaller regret with high efficiency than commonly used methods such as ε-greedy, softmax, or UCB1. Therefore the CNAMV can be an effective SMAB method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.