Abstract

We study dynamic allocation problems for discrete time multi-armed bandits under uncertainty, based on the the theory of nonlinear expectations. We show that, under independence assumption on the bandits and with some relaxation in the definition of optimality, a Gittins allocation index gives optimal choices. This involves studying the interaction of our uncertainty with controls which determine the filtration. We also run a simple numerical example which illustrates the interaction between the willingness to explore and uncertainty aversion of the agent when making decisions.

Highlights

  • When making decisions, people generally have a strict preference for options which they understand well

  • Since the classical work of Knight [52] and Keynes [50], there has been a stream of thinking within economics and statistics that focuses on the difference between the randomness of an outcome and lack of knowledge of its probability distribution

  • We have discussed in the previous section that the robust Gittins index (2.4) in the sense of Caro and Gupta [20] is not optimal, as it does not lead to a solution of the robust Bellman equation

Read more

Summary

Introduction

People generally have a strict preference for options which they understand well. In order to address this issue, while accounting for uncertainty, we discuss an alternative approach to deriving a time-consistent control problem, based on ideas from indifference pricing and the martingale optimality principle Using this approach, we show that when comparing different independent options, we can calculate an index separately for each alternative such that the ‘optimal’ strategy is always to choose the option with the smallest index. We show that when comparing different independent options, we can calculate an index separately for each alternative such that the ‘optimal’ strategy is always to choose the option with the smallest index This idea was initially proposed by Gittins and Jones [41] (see [42, 40]) in a context where the probability measure is fixed but estimation (in a Bayesian perspective) is modeled by the evolution of a Markov process. We shall see that our algorithm gives behaviour which is both optimistic and pessimistic in different regimes, and compares well with existing methods for multi-armed bandits

Multi-armed bandits
General problem formulation
Classical Gittins theorem
Robust Gittins index
Uncertainty on multiple bandits
Optimality
C-Optimality
Endowment effect
Overview of bandits under uncertainty
Robust Gittins theorem
Sketch of the Proof
Information structures for multi-armed bandits
Multi-armed bandit optimality
Numerical results
Prospect theory
Monte-Carlo simulation
Measures of regret
Policy for multi-armed-bandits
Robustness of the DR algorithms
A Part A: analysis of a single bandit
B Part B: analysis of multiple bandits
C Proof of other relevant results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.