Understanding Stochastic Optimization Behavior at the Layer Update Level (Student Abstract)

Jack Zhang,Ian Tong Pan,Alexandru Lopotenco,Guan Xiong Qiao

doi:10.1609/aaai.v36i11.21691

Jack Zhang, Ian Tong Pan + Show 2 more

Open Access

PDF Available

https://doi.org/10.1609/aaai.v36i11.21691

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Popular first-order stochastic optimization methods for deep neural networks (DNNs) are usually either accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum) or adaptive step-size methods (e.g. Adam/AdaMax, AdaBelief). In many contexts, including image classification with DNNs, adaptive methods tend to generalize poorly compared to SGD, i.e. get stuck in non-robust local minima; however, SGD typically converges slower. We analyze possible reasons for this behavior by modeling gradient updates as vectors of random variables and comparing them to probabilistic bounds to identify "meaningful" updates. Through experiments, we observe that only layers close to the output have "definitely non-random" update behavior. In the future, the tools developed here may be useful in rigorously quantifying and analyzing intuitions about why some optimizers and particular DNN architectures perform better than others.

Full Text