Abstract

We prove that a single-layer neural network trained with the Q-learning algorithm converges in distribution to a random ordinary differential equation as the size of the model and the number of training steps become large. Analysis of the limit differential equation shows that it has a unique stationary solution that is the solution of the Bellman equation, thus giving the optimal control for the problem. In addition, we study the convergence of the limit differential equation to the stationary solution. As a by-product of our analysis, we obtain the limiting behavior of single-layer neural networks when trained on independent and identically distributed data with stochastic gradient descent under the widely used Xavier initialization.

Highlights

  • Reinforcement learning with neural networks has had a number of recent successes, including learning to play video games (Mnih et al 2013, 2015), mastering the game of Go (Silver et al 2017), and robotics (Kober and Peters 2012)

  • We prove that a single-layer neural network trained with the Q-learning algorithm converges in distribution to a random ordinary differential equation as the size of the model and the number of training steps become large

  • We prove that the Q− network converges to the solution of a random ordinary differential equation (ODE)

Read more

Summary

Introduction

Reinforcement learning with neural networks (frequently called “deep reinforcement learning”) has had a number of recent successes, including learning to play video games (Mnih et al 2013, 2015), mastering the game of Go (Silver et al 2017), and robotics (Kober and Peters 2012). The presence of a neural network in the Q-learning algorithm introduces technical challenges, which lead us to be able to prove, in the infinite time horizon case, convergence of the limiting ODE to the stationary solution only for small values of the discount factor. The situation is somewhat different in the finite time horizon case, in which we can prove that the limit ODE converges to a global minimum, which is the solution of the associated Bellman equation, for all values of the discount factor in In addition to characterizing the limiting behavior of the neural network as the number of hidden units and stochastic gradient descent steps grow to infinity, we obtain that the neural network in the limit converges to a global minimum with zero training loss (see Section 4).

Q-Learning Algorithm
The Finite Time Horizon Setting
Reinforcement Learning with Neural Network Approximation
Main Results
A Special Case
Proof of Convergence in Infinite Time Horizon Case
Identification of the Limit
Proof of Convergence
Analysis of the Limit Equation
Proof of Convergence in Finite Time Horizon Case
Proof That A Is Positive Definite
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.