Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI

Sebastian Pilarski,Daniel Varro,Slawomir Pilarski

doi:10.1109/tai.2021.3117743

Sebastian Pilarski, Daniel Varro + Show 1 more

Open Access

https://doi.org/10.1109/tai.2021.3117743

Copy DOI

Abstract

Bernoulli multi-armed bandits are a reinforcement learning model used to optimize the sequences of decisions with binary outcomes. Well-known bandit algorithms, including the optimal policy, assume that before a decision is made the outcomes of previous decisions are known. This assumption is often not satisfied in real-life scenarios. As demonstrated in this article, if decision outcomes are affected by delays, the performance of existing algorithms can be severely affected. We present the first practically applicable method to compute statistically optimal decisions in the presence of outcome delays. Our method has a predictive component abstracted out into a meta-algorithm, predictive algorithm reducing delay impact (PARDI), which significantly reduces the impact of delays on commonly used algorithms. We demonstrate empirically that PARDI-enhanced Whittle index is nearly optimal for a wide range of Bernoulli bandit parameters and delays. In a wide spectrum of experiments, it performed better than any other suboptimal algorithm, e.g., UCB1-tuned and Thompson sampling. PARDI-enhanced Whittle index can be used when computational requirements of the optimal policy are too high.

Highlights

Making choices is an integral part of everyday life
We present the first method of determining the optimal strategy for these type of situations and a meta-algorithm Predictive Algorithm Reducing Delay Impact (PARDI) that drastically improves the quality of decisions by well-known algorithms – lowers regret by up to 3x
We show that the optimal policy indexing scheme presented in [16] makes such computations practically feasible for 2-arm and 3-arm bandits

Summary

INTRODUCTION

Making choices is an integral part of everyday life. In most situations the outcome of a decision is uncertain. Multi-Armed Bandits: A probabilistic model used in reinforcement learning is referred to as a multi-armed bandit [1] It assumes that decisions are sequential and at each time one of a finite number of options is selected. The model assumes that each arm, when pulled, produces a random reward according to its own probability distribution, which is unknown to the player. Fundamental bandit algorithms were developed under the assumption that the rewards are immediate, i.e., known to the algorithm at the time of subsequent decision. This may be a serious limitation in practical applications [4], [14] and it gets an increasing attention in research [17]. All applications discussed earlier can operate in environments where rewards are subject to delay

Related Work and Motivation

Objectives

Definitions and Notation

Suboptimal Algorithms

DELAY IMPACT

OPTIMAL POLICY UNDER DELAY

Unknown-Rewards - Outcome Analysis

Expression

Computation

16.8 G 184 G 1 T

Empirical Evaluation

META-ALGORITHM PARDI

PARDI: EMPIRICAL EVALUATION

Findings

SUMMARY AND CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE transactions on artificial intelligence	Publication Date: Apr 1, 2022
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE transactions on artificial intelligence

Lead the way for us

Similar Papers

Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge
Sebastian Pilarski ... Daniel Varro
IEEE transactions on artificial intelligence | VOL. 2
Sebastian Pilarski, et. al.Sebastian Pilarski ... Daniel Varro
01 Feb 2021
IEEE transactions on artificial intelligence | VOL. 2

Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits
Hongju Park ... Mohamad Kazem Shirani Faradonbeh
IEEE control systems letters | VOL. 6
Hongju Park, et. al.Hongju Park ... Mohamad Kazem Shirani Faradonbeh
01 Jan 2021
IEEE control systems letters | VOL. 6

Generalized Thompson sampling for sequential decision-making and causal inference
Pedro A Ortega ... Daniel A Braun
Complex Adaptive Systems Modeling | VOL. 2
Pedro A Ortega, et. al.Pedro A Ortega ... Daniel A Braun
14 Mar 2014
Complex Adaptive Systems Modeling | VOL. 2

Deep Reinforcement Learning for Web Crawling
Konstantin Avrachenkov ... Vivek Borkar
-
Konstantin Avrachenkov, et. al.Konstantin Avrachenkov ... Vivek Borkar
20 Dec 2021
20 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE transactions on artificial intelligence