Towards Q-learning the Whittle Index for Restless Bandits

Jing Fu,Peter G Taylor,Yoni Nazarathy,Sarat Moka

doi:10.1109/anzcc47194.2019.8945748

Abstract

We consider the multi-armed restless bandit problem (RMABP) with an infinite horizon average cost objective. Each arm of the RMABP is associated with a Markov process that operates in two modes: active and passive. At each time slot a controller needs to designate a subset of the arms to be active, of which the associated processes will evolve differently from the passive case. Treated as an optimal control problem, the optimal solution of the RMABP is known to be computationally intractable. In many cases, the Whittle index policy achieves near optimal performance and can be tractably found. Nevertheless, computation of the Whittle indices requires knowledge of the transition matrices of the underlying processes, which are sometimes hidden from decision makers. In this paper, we take first steps towards a tractable and efficient reinforcement learning algorithm for controlling such a system. We setup parallel Q-learning recursions, with each recursion mapping to individual possible values of the Whittle index. We then update these recursions as we control the system, learning an approximation of the Whittle index as time evolves. Tested on several examples, our control outperforms naive priority allocations and nears the performance of the fully-informed Whittle index policy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Towards Q-learning the Whittle Index for Restless Bandits

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Tabular and Deep Learning for the Whittle Index
Francisco Robledo Relaño ... Konstantin Avrachenkov
ACM Transactions on Modeling and Performance Evaluation of Computing Systems | VOL. 9
Francisco Robledo Relaño, et. al.Francisco Robledo Relaño ... Konstantin Avrachenkov
13 Aug 2024
ACM Transactions on Modeling and Performance Evaluation of Computing Systems | VOL. 9

Index-based sampling policies for tracking dynamic networks under sampling constraints
Ting He ... Dakshi Agrawal
-
Ting He, et. al.Ting He ... Dakshi Agrawal
01 Apr 2011
01 Apr 2011

Dynamic spectrum access under partial observations: A restless bandit approach
Nima Akbarzadeh ... Aditya Mahajan
-
Nima Akbarzadeh, et. al.Nima Akbarzadeh ... Aditya Mahajan
01 Jun 2019
01 Jun 2019

Large scale charging of electric vehicles: A multi-armed bandit approach
Zhe Yu ... Yunjian Xu
-
Zhe Yu, et. al.Zhe Yu ... Yunjian Xu
01 Sep 2015
01 Sep 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards Q-learning the Whittle Index for Restless Bandits

Abstract

Talk to us

Similar Papers