Abstract

We propose a theoretical and computational framework for approximating the optimal policy in multi-armed bandit problems where the reward distributions are non-Gaussian. We first construct a probabilistic interpolation of the sequence of discrete-time rewards in the form of a continuous-time conditional Levy process. In the Gaussian setting, this approach allows an easy connection to Brownian motion and its convenient time-change properties. No such device is available for non-Gaussian rewards; however, we show how optimal stopping theory can be used to characterize the value of the optimal policy, using a free-boundary partial integro-differential equation, for exponential and Poisson rewards. We then solve this problem numerically to approximate the set of belief states possessing a given optimal index value, and provide illustrations showing that the solution behaves as expected.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call