Abstract
Based on the traditional paradigm of multi-armed bandit with a fixed total budget, we further investigate the multi-armed bandit with time-variant budgets (MAB-TV), which captures the situation where the learner's actions are constrained by a random budget in each round of arm-pulling. Besides, we assume that the learner needs to pay a random cost for pulling a particular arm and could obtain a random reward accordingly. The learner can pull multiple arms in each round, as long as the corresponding random budget is satisfied. To solve MAB-TV, we design a Greedy-TV algorithm. During each round, we first estimate the real average values of the reward and the cost of each arm, and then under the budget constraint, we pull some arms according to the estimated ratio of the average reward to the average cost regarding each arm. We derive the regret bound of Greedy-Tvregarding the achievable sum of rewards. Evaluations results validate the superiority of Greedy-TV, compared with three existing benchmark algorithms under varying distributions of rewards and costs and the different number of arms.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.