Abstract
A longstanding debate surrounds the related hypotheses that low-curvature minima generalize better, and that stochastic gradient descent (SGD) discourages curvature. We offer a more complete and nuanced view in support of both hypotheses. First, we show that curvature harms test performance through two new mechanisms, the shift-curvature and bias-curvature, in addition to a known parameter-covariance mechanism. The shift refers to the difference between train and test local minima, and the bias and covariance are those of the parameter distribution. These three curvature-mediated contributions to test performance are reparametrization-invariant even though curvature itself is not. Although the shift is unknown at training time, the shift-curvature as well as the other mechanisms can still be mitigated by minimizing overall curvature. Second, we derive a new, explicit SGD steady-state distribution showing that SGD optimizes an effective potential related to but different from train loss, and that SGD noise mediates a trade-off between low-loss versus low-curvature regions of this effective potential. Third, combining our test performance analysis with the SGD steady state shows that for small SGD noise, the shift-curvature is the dominant of the three mechanisms. Our experiments demonstrate the significant impact of shift-curvature on test loss, and further explore the relationship between SGD noise and curvature.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.