Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Dimitri P Bertsekas,Huizhen Yu

doi:10.1287/moor.1110.0532

Abstract

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematics of Operations Research	Publication Date: Feb 1, 2012
Citations: 42	License type: cc-by-nc-sa

R Discovery Prime

R Discovery Prime

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Abstract

Talk to us

Similar Papers

More From: Mathematics of Operations Research

Lead the way for us

Similar Papers

Q-learning and policy iteration algorithms for stochastic shortest path problems
Huizhen Yu ... Dimitri P Bertsekas
Annals of Operations Research | VOL. 208
Huizhen Yu, et. al.Huizhen Yu ... Dimitri P Bertsekas
18 Apr 2012
Annals of Operations Research | VOL. 208

Q-learning and enhanced policy iteration in discounted dynamic programming
Dimitri P Bertsekas ... Huizhen Yu
-
Dimitri P Bertsekas, et. al.Dimitri P Bertsekas ... Huizhen Yu
01 Dec 2010
01 Dec 2010

Randomised Procedures for Initialising and Switching Actions in Policy Iteration
Shivaram Kalyanakrishnan ... Neeldhara Misra
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 30
Shivaram Kalyanakrishnan, et. al.Shivaram Kalyanakrishnan ... Neeldhara Misra
05 Mar 2016
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 30

Policy Iteration Based on a Learned Transition Model
Vivek Ramavajjala ... Charles Elkan
-
Vivek Ramavajjala, et. al.Vivek Ramavajjala ... Charles Elkan
01 Jan 2012
01 Jan 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Abstract

Talk to us

Similar Papers

More From: Mathematics of Operations Research