Abstract

Stochastic gradient descent (SGD) is of fundamental importance in deep learning. Despite its simplicity, elucidating its efficacy remains challenging. Conventionally, the success of SGD is ascribed to the stochastic gradient noise (SGN) incurred in the training process. Based on this consensus, SGD is frequently treated and analyzed as the Euler-Maruyama discretization of stochastic differential equations (SDEs) driven by either Brownian or Lévy stable motion. In this study, we argue that SGN is neither Gaussian nor Lévy stable. Instead, inspired by the short-range correlation emerging in the SGN series, we propose that SGD can be viewed as a discretization of an SDE driven by fractional Brownian motion (FBM). Accordingly, the different convergence behavior of SGD dynamics is well-grounded. Moreover, the first passage time of an SDE driven by FBM is approximately derived. The result suggests a lower escaping rate for a larger Hurst parameter, and thus, SGD stays longer in flat minima. This happens to coincide with the well-known phenomenon that SGD favors flat minima that generalize well. Extensive experiments are conducted to validate our conjecture, and it is demonstrated that short-range memory effects persist across various model architectures, datasets, and training strategies. Our study opens up a new perspective and may contribute to a better understanding of SGD.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.