Abstract

The multi-armed bandit (MAB) problem is one of the most well-known active learning frameworks. The aim is to select the best among a set of actions by sequentially observing rewards that come from an unknown distribution. Recently, a number of distributed bandit applications have become popular over wireless networks, where agents geographically separated from a learner collect and communicate the observed rewards. In this paper we propose a compression scheme, that compresses the rewards collected by the distributed agents. By providing nearly matching upper and lower bounds, we tightly characterize the number of bits needed per reward for the learner to accurately learn without suffering additional regret. In particular, we establish a generic reward quantization algorithm, that can be applied on top of any (no-regret) MAB algorithm to form a new communication-efficient counterpart. requires only a few (converging to as low as 3 bits as the number of iterations increases) bits to be sent per reward while preserving the same regret bound as uncompressed rewards. Our lower bound is established via constructing hard instances from a subgaussian distribution. Our theory is further corroborated by numerical experiments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call