Abstract
Bernoulli multi-armed bandits are a reinforcement learning model used to optimize the sequences of decisions with binary outcomes. Well-known bandit algorithms, including the optimal policy, assume that before a decision is made the outcomes of previous decisions are known. This assumption is often not satisfied in real-life scenarios. As demonstrated in this article, if decision outcomes are affected by delays, the performance of existing algorithms can be severely affected. We present the first practically applicable method to compute statistically optimal decisions in the presence of outcome delays. Our method has a predictive component abstracted out into a meta-algorithm, predictive algorithm reducing delay impact (PARDI), which significantly reduces the impact of delays on commonly used algorithms. We demonstrate empirically that PARDI-enhanced Whittle index is nearly optimal for a wide range of Bernoulli bandit parameters and delays. In a wide spectrum of experiments, it performed better than any other suboptimal algorithm, e.g., UCB1-tuned and Thompson sampling. PARDI-enhanced Whittle index can be used when computational requirements of the optimal policy are too high.
Highlights
Making choices is an integral part of everyday life
We present the first method of determining the optimal strategy for these type of situations and a meta-algorithm Predictive Algorithm Reducing Delay Impact (PARDI) that drastically improves the quality of decisions by well-known algorithms – lowers regret by up to 3x
We show that the optimal policy indexing scheme presented in [16] makes such computations practically feasible for 2-arm and 3-arm bandits
Summary
Making choices is an integral part of everyday life. In most situations the outcome of a decision is uncertain. Multi-Armed Bandits: A probabilistic model used in reinforcement learning is referred to as a multi-armed bandit [1] It assumes that decisions are sequential and at each time one of a finite number of options is selected. The model assumes that each arm, when pulled, produces a random reward according to its own probability distribution, which is unknown to the player. Fundamental bandit algorithms were developed under the assumption that the rewards are immediate, i.e., known to the algorithm at the time of subsequent decision. This may be a serious limitation in practical applications [4], [14] and it gets an increasing attention in research [17]. All applications discussed earlier can operate in environments where rewards are subject to delay
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have