Abstract

Bernoulli multi-armed bandits are a reinforcement learning model used to optimize the sequences of decisions with binary outcomes. Well-known bandit algorithms, including the optimal policy, assume that before a decision is made the outcomes of previous decisions are known. This assumption is often not satisfied in real-life scenarios. As demonstrated in this article, if decision outcomes are affected by delays, the performance of existing algorithms can be severely affected. We present the first practically applicable method to compute statistically optimal decisions in the presence of outcome delays. Our method has a predictive component abstracted out into a meta-algorithm, predictive algorithm reducing delay impact (PARDI), which significantly reduces the impact of delays on commonly used algorithms. We demonstrate empirically that PARDI-enhanced Whittle index is nearly optimal for a wide range of Bernoulli bandit parameters and delays. In a wide spectrum of experiments, it performed better than any other suboptimal algorithm, e.g., UCB1-tuned and Thompson sampling. PARDI-enhanced Whittle index can be used when computational requirements of the optimal policy are too high.

Highlights

  • Making choices is an integral part of everyday life

  • We present the first method of determining the optimal strategy for these type of situations and a meta-algorithm Predictive Algorithm Reducing Delay Impact (PARDI) that drastically improves the quality of decisions by well-known algorithms – lowers regret by up to 3x

  • We show that the optimal policy indexing scheme presented in [16] makes such computations practically feasible for 2-arm and 3-arm bandits

Read more

Summary

INTRODUCTION

Making choices is an integral part of everyday life. In most situations the outcome of a decision is uncertain. Multi-Armed Bandits: A probabilistic model used in reinforcement learning is referred to as a multi-armed bandit [1] It assumes that decisions are sequential and at each time one of a finite number of options is selected. The model assumes that each arm, when pulled, produces a random reward according to its own probability distribution, which is unknown to the player. Fundamental bandit algorithms were developed under the assumption that the rewards are immediate, i.e., known to the algorithm at the time of subsequent decision. This may be a serious limitation in practical applications [4], [14] and it gets an increasing attention in research [17]. All applications discussed earlier can operate in environments where rewards are subject to delay

Related Work and Motivation
Objectives
Definitions and Notation
Suboptimal Algorithms
DELAY IMPACT
OPTIMAL POLICY UNDER DELAY
Unknown-Rewards - Outcome Analysis
Expression
Computation
16.8 G 184 G 1 T
Empirical Evaluation
META-ALGORITHM PARDI
PARDI: EMPIRICAL EVALUATION
Findings
SUMMARY AND CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call