Abstract

Multi-armed bandit problem is a fundamental mathematical problem in sequential optimization and reinforcementlearning that has a variety of application such as online recommendation system and clinical trial design. Multiarmedbandit problem can describe a situation in which a player tries to select a good choice sequentially from givencandidate choices to maximize the cumulative reward. In this paper, we consider the non-stationary multi-armed banditproblems. Non-stationary means the reward distribution of each arm varies with time. We point out that in somereal application, we can utilize information on the change of reward distribution. Especially we consider the type ofinformation that may restrict the rounds at which the reward distribution changes. Against such scenario, we proposea novel strategy called PM policy. The proposed policy is based on existing CUSUM-UCB policy and M-UCB policythat do not consider external information. Though such existing policies monitor all arms to detect the change ofreward distribution, our policy monitors only important arms and rounds. As a result, the ratio of unnecessary monitoringis reduced, and an efficient search can be performed. The regret bound of the proposed policy is described. Wealso show the effectiveness of the proposed method by numerical experiments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call