Optimistic Whittle Index Policy: Online Learning for Restless Bandits

Kai Wang,Aparna Taneja,Milind Tambe,Lily Xu

doi:10.1609/aaai.v37i8.26207

Abstract

Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. Solving RMABs requires information on transition dynamics, which are often unknown upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we estimate confidence bounds of the transition probabilities and formulate a bilinear program to compute optimistic Whittle indices using these estimates. Our algorithm, UCWhittle, achieves sublinear O(H \sqrt{T log T}) frequentist regret to solve RMABs with unknown transitions in T episodes with a constant horizon H. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including one constructed from a real-world maternal and childcare dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Optimistic Whittle Index Policy: Online Learning for Restless Bandits

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Health
Kai Wang ... Shresth Verma
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37
Kai Wang, et. al.Kai Wang ... Shresth Verma
26 Jun 2023
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37

Index-based sampling policies for tracking dynamic networks under sampling constraints
Ting He ... Dakshi Agrawal
-
Ting He, et. al.Ting He ... Dakshi Agrawal
01 Apr 2011
01 Apr 2011

Uncertainty-of-Information Scheduling: A Restless Multiarmed Bandit Framework
Gongpu Chen ... Soung Chang Liew
IEEE Transactions on Information Theory | VOL. 68
Gongpu Chen, et. al.Gongpu Chen ... Soung Chang Liew
01 Sep 2022
IEEE Transactions on Information Theory | VOL. 68

Towards Q-learning the Whittle Index for Restless Bandits
Jing Fu ... Sarat Moka
-
Jing Fu, et. al.Jing Fu ... Sarat Moka
01 Nov 2019
01 Nov 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimistic Whittle Index Policy: Online Learning for Restless Bandits

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence