Neural Temporal Difference and Q Learning Provably Converge to Global Optima

Qi Cai,Zhuoran Yang,Jason D Lee,Zhaoran Wang

doi:10.1287/moor.2023.1370

Abstract

Temporal difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, because of the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparameterization of neural networks, which also plays a vital role in the empirical success of neural TD. We establish the theory for two-layer neural networks in the main paper and extend them to multilayer neural networks in the appendix. Beyond policy evaluation, we establish the global convergence of neural (soft) Q learning. Funding: Z. Yang acknowledges the Theory of Reinforceement Learning program at Simons Institute. J. D. Lee acknowledges support of the ARO under MURI Award W911NF-11-1-0304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, ONR Young Investigator Award, and NSF-CAREER under award #2144994. Z. Wang acknowledges National Science Foundation [Awards 2048075, 2008827, 2015568, 1934931], Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports.

Full Text