Abstract
We prove that a single-layer neural network trained with the Q-learning algorithm converges in distribution to a random ordinary differential equation as the size of the model and the number of training steps become large. Analysis of the limit differential equation shows that it has a unique stationary solution that is the solution of the Bellman equation, thus giving the optimal control for the problem. In addition, we study the convergence of the limit differential equation to the stationary solution. As a by-product of our analysis, we obtain the limiting behavior of single-layer neural networks when trained on independent and identically distributed data with stochastic gradient descent under the widely used Xavier initialization.
Highlights
Reinforcement learning with neural networks has had a number of recent successes, including learning to play video games (Mnih et al 2013, 2015), mastering the game of Go (Silver et al 2017), and robotics (Kober and Peters 2012)
We prove that a single-layer neural network trained with the Q-learning algorithm converges in distribution to a random ordinary differential equation as the size of the model and the number of training steps become large
We prove that the Q− network converges to the solution of a random ordinary differential equation (ODE)
Summary
Reinforcement learning with neural networks (frequently called “deep reinforcement learning”) has had a number of recent successes, including learning to play video games (Mnih et al 2013, 2015), mastering the game of Go (Silver et al 2017), and robotics (Kober and Peters 2012). The presence of a neural network in the Q-learning algorithm introduces technical challenges, which lead us to be able to prove, in the infinite time horizon case, convergence of the limiting ODE to the stationary solution only for small values of the discount factor. The situation is somewhat different in the finite time horizon case, in which we can prove that the limit ODE converges to a global minimum, which is the solution of the associated Bellman equation, for all values of the discount factor in In addition to characterizing the limiting behavior of the neural network as the number of hidden units and stochastic gradient descent steps grow to infinity, we obtain that the neural network in the limit converges to a global minimum with zero training loss (see Section 4).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.