In this paper, we investigate Thompson sampling-based sequential block elimination approaches for dynamic assignment problems in a pure-exploration Multi-Armed Bandit (MAB) setting with limited budget constraints. The problem can be considered as a bandit game-play between the environment and a decision-maker in a metric space. Many instances of problems in fields such as e-commerce, logistics, mobility management, data management and operations research can be framed as dynamic assignment problems with budget constraints. Given an l-dimensional action space representing l variants of an entity and a budget for exploring the action space, the optimal dynamic assignment problem refers to the task of identifying the values to be assigned to different variants of the entity that maximizes the total reward by utilizing at most the given budget of rounds of play. We contribute a class of block elimination-based MAB algorithms specifically designed for the dynamic assignment problem with limited budget. Our algorithms begin by discretizing the continuous action space into a finite set of discrete actions, then proceed with a recursive block elimination procedure to remove sub-optimal actions. The elimination is carried out by calculating confidence bounds over blocks of actions. We explore two different confidence bound estimation techniques. We perform comprehensive experiments on two problem instances from distributed data management and logistics. Our results showcase that our approach yields a lower misidentification probability (i.e., the probability of recommending a non-optimal action) compared to state-of-the-art elimination-based pure-exploration MAB algorithms.
Read full abstract