Abstract

We consider the problem of learning a policy for a Markov decision process consistent with data captured on the state-action pairs followed by the policy. We parameterize the policy using features associated with the state-action pairs. The features can be handcrafted or defined using kernel functions in a reproducing kernel Hilbert space. In either case, the set of features can be large and only a small, unknown subset may need to be used to fit a specific policy to the data. The parameters of such a policy are recovered using $\ell _1$ -regularized logistic regression. We establish bounds on the difference between the average reward of the estimated and the unknown original policies (regret) in terms of the generalization error and the ergodic coefficient of the underlying Markov chain. To that end, we combine sample complexity theory and sensitivity analysis of the stationary distribution of Markov chains. Our analysis suggests that to achieve regret within order $O(\sqrt{\epsilon })$ , it suffices to use training sample size of the order of $\Omega (\log n \cdot \text{poly}(1/\epsilon))$ , where $n$ is the number of the features. We demonstrate the effectiveness of our method on a synthetic robot navigation example.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call