Abstract

We study a decentralized cooperative multi-agent multi-armed bandit (MAB) problem with K arms and N agents connected over a network. In this model, each arm's reward distribution is the same for every agent, and rewards are drawn independently across agents and over time steps. At each iteration, agents independently choose an arm to play and exchange at most poly(K) real-valued messages with their neighbors. Existing lower bound on the average per-agent regret over the network shows that cooperation with other agents can achieve a reduction of O(\frac 1N) over when playing in isolation. Motivated by this, we study a message-passing algorithm that can be combined with existing Bayesian MAB algorithms. Using this, we propose a decentralized Thompson Sampling (TS) algorithm and a decentralized Bayes-UCB algorithm. Under decentralized TS for bounded rewards, we establish a problem-dependent upper bound on the average per-agent regret in terms of the number of agents in the network and the network topology. For Bernoulli rewards, our upper bound asymptotically matches the lower bound. We empirically show that the proposed decentralized TS algorithm incurs significantly lesser per-agent regret than previously proposed algorithms. Furthermore, we show that the proposed decentralized TS can be extended to general bandit problems, where posterior distribution cannot be computed in closed form. We combine our decentralized TS algorithm with Variational Inference to apply our proposed algorithm to complex realistic reward distributions in a computationally efficient manner. We implement our proposed decentralized TS under gossip protocol and over time-varying networks, where each communication link has a fixed probability of failure and show that it incurs logarithmic regret.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call