Abstract

In many sequential decision-making settings where there is uncertainty about the reward of each action, frequent selection of specific actions may reduce expected reward while choosing less frequently selected actions could lead to an increase. These effects are commonly observed in settings ranging from personalized healthcare interventions and targeted online advertising. To address this problem, the authors propose a new class of models called ROGUE (reducing or gaining unknown efficacy) multiarmed bandits. In the paper, the authors present a maximum likelihood approach to estimate the parameters of these models, and we show that these estimates can be used to construct upper confidence bound algorithms and epsilon-greedy algorithms for optimizing these models with strong theoretical guarantees. The authors conclude with a simulation study to show that these algorithms perform better than current nonstationary bandit algorithms in terms of both cumulative regret and average reward.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call