The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima

Yu Feng,Yuhai Tu

doi:10.1073/pnas.2015617118

Abstract

Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima

Abstract

Talk to us

Similar Papers

More From: Proceedings of the National Academy of Sciences of the United States of America

Lead the way for us

Journal: Proceedings of the National Academy of Sciences of the United States of America	Publication Date: Feb 22, 2021
Citations: 30

Similar Papers

COVID-19 Fake News Detection System
Ruchika Malhotra ... Anushree Mahur
-
Ruchika Malhotra, et. al.Ruchika Malhotra ... Anushree Mahur
27 Jan 2022
27 Jan 2022

To regularize or not: Revisiting SGD with simple algorithms and experimental studies
Wenwu He ... Yang Liu
Expert systems with applications | VOL. 112
Wenwu He, et. al.Wenwu He ... Yang Liu
15 Jun 2018
Expert systems with applications | VOL. 112

Differentially private SGD with non-smooth losses
Puyu Wang ... Yunwen Lei
Applied and computational harmonic analysis | VOL. 56
Puyu Wang, et. al.Puyu Wang ... Yunwen Lei
01 Jan 2021
Applied and computational harmonic analysis | VOL. 56

The Improved Stochastic Fractional Order Gradient Descent Algorithm
Yang Yang ... Yusen Hu
Fractal and Fractional | VOL. 7
Yang Yang, et. al.Yang Yang ... Yusen Hu
18 Aug 2023
Fractal and Fractional | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima

Abstract

Talk to us

Similar Papers

More From: Proceedings of the National Academy of Sciences of the United States of America