Abstract

The expected reward in a linear stochastic bandit model is an unknown linear function of the chosen decision vector. In this paper, we consider the case where the expected reward is an unknown linear function of a projection of the decision vector onto a subspace. We call this the projection reward. Unlike the classical linear bandit problem, we assume that the projection reward is unobservable. Instead, the observed "reward" at each time step is the projection reward corrupted by another linear function of the decision vector projected onto a subspace orthogonal to the first. Such a model is useful in recommendation applications where the observed reward is corrupted by each individual's biases. In the case where there are finitely many decision vectors, we develop a strategy to achieve O(log T ) regret, where T is the number of time steps. In the case where the decision vector is chosen from an infinite compact set, our strategy achieves O(T <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2/3</sup> (log T ) <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1/2</sup> ) regret. Simulations verify the efficiency of our strategy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call