Minimax Normal Two-Armed Bandit with Indefinite Control Horizon

Alexander Kolnogorov,N Bardis,N Doukas,C Guarnaccia,J Quartieri

doi:10.1051/itmconf/20170901002

Abstract

We consider the two-armed bandit problem as applied to data processing if there are two alternative processing methods available with different a priori unknown efficiencies. On should determine the most effective method and provide its dominating application. The total number of data, which is interpreted as a control horizon, is assumed to have a priori known probability distribution.The problem is considered in minimax (robust) setting. According to the main theorem of the theory of games minimax risk and minimax strategy are sought for as Bayesian ones corresponding to the worst-case prior distribution. We describe the properties of the worst-case prior and present a recursive Bellman-type equation for determination of both minimax strategy and minimax risk. Numerical results illustrating the proposed algorithm are given. The algorithm can be applied to optimization of parallel data processing if the number of processed data is not definitely known in advance.

Highlights

We consider the two-armed bandit problem which is well-known as the problem of expedient behavior in a random environment and the problem of adaptive control in the following setting
Let ξn, n = 1, . . . , N be a controlled random process which values are interpreted as incomes, depend only on currently chosen actions yn and have Normal probability distribution densities f (x|m ) = (2π)−1/2 exp −(x − m )2/2, if yn = ( = 1, 2)
Control strategy σ at the point of time n assigns a random choice of the action yn depending on the current history of the process, i.e. replies xn−1 = x1, . . . , xn−1 to applied actions yn−1 = y1, . . . , yn−1: σ = Pr(yn = |yn−1, xn−1)

Summary

Introduction

We consider the two-armed bandit problem (see, e.g. [1], [2]) which is well-known as the problem of expedient behavior in a random environment (see, e.g. [3], [4]) and the problem of adaptive control (see, e.g. [5], [6]) in the following setting. N be a controlled random process which values are interpreted as incomes, depend only on currently chosen actions yn (yn ∈ {1, 2}) and have Normal probability distribution densities f (x|m ) = (2π)−1/2 exp −(x − m )2/2 , if yn = ( = 1, 2). Normal two-armed bandit can be described by a vector parameter θ = (m1, m2). Control strategy σ at the point of time n assigns a random choice of the action yn depending on the current history of the process, i.e. replies xn−1 = x1, . N=1 describes expected losses of total income with respect to its maximal possible value due to incomplete information. According to the minimax approach the maximal value of the loss function on the set of parameters Θ should be minimized on the set of strategies Σ.

Objectives

Results

Conclusion