Abstract

Q learning algorithm is a popular reinforcement learning method for finite state/action fully observed Markov decision processes (MDPs). In this paper, we make two contributions: (i) we establish the convergence of a Q learning algorithm for partially observed Markov decision processes (POMDPs) using a finite history of past observations and control actions and show that the limit fixed point equation gives an optimal solution for an approximate belief-MDP. We then provide bounds on the performance of the policy obtained using the limit Q values compared to the performance of the optimal policy for the POMDP, where we also present explicit performance guarantees using recent results on filter stability in controlled POMDPs. (ii) We apply these results to fully observed MDPs with continuous state spaces and establish the near optimality of learned policies via quantization of the state space, where the quantization is viewed as a measurement channel leading to a POMDP model and a history of unit window size is considered. In particular, we show that Q-learning, with its convergence and near optimality properties, is applicable for continuous space MDPs when the state space is quantized.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call