Abstract

Multi-armed bandits (MAB) are widely applied to optimize networking applications such as crowdsensing and mobile edge computing. Additional feedbacks (or partial feedbacks) on some arms are usually possible to be collected in many networking applications but with certain costs. Though a variety of algorithms were proposed to utilize such feedbacks to speed up MAB learning, it is still unclear regrading: (1) How to schedule the budgets across the MAB learning period? (2) Which feedbacks to select from the additional feedback set? To fill in this gap, we design AF-UCB algorithm. The contribution of AF-UCB is a generic additional feedback processing algorithm, which is composed of an elimination module, a schedule module and a selection module. The elimination module eliminates additional feedbacks that may slow down the learning speed. It is inspired by findings from an in-depth numerical analysis, which shows that some feedbacks may slow down the learning speed (slower than not using any additional feedback at all) of two latest representative algorithms. The schedule module adaptively schedules budgets and the selection module selects appropriate feedbacks from the additional feedback set. We derive a sub-linear regret upper bound for AF-UCB. Extensive numerical experiments validate the superior performance of AF-UCB.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call