Abstract

This paper presents an algorithm and regret analysis for the restless hidden Markov bandit problem with linear rewards. In this problem the reward received by the decision maker is a random linear function which depends on the arm selected and a hidden state. In contrast to previous works on Markovian bandits, we do not assume that the decision maker receives information regarding the state of the system, but can only infer/estimate it based on its actions and the received reward. Additionally, it is assumed that the decision maker knows in advance that the reward is a random linear function which depends on the selected arm, the action, and hidden states. However, the decision maker does not know in advance the probability distributions of these hidden states; thus we call this side information structural side information. Surprisingly, we can still maintain logarithmic regret in the case of polyhedral action set. Furthermore, we show that the structural side information leads to expected regret that does not depend on the number of extreme points in the action space.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call