A convergence analysis of Nesterov’s accelerated gradient method in training deep linear neural networks

Xin Liu,Wei Tao,Zhisong Pan

doi:10.1016/j.ins.2022.08.090

Abstract

As the training process of deep neural networks involves expensive computational cost, speeding up the convergence is of great importance. Nesterov’s accelerated gradient (NAG) is one of the most popular accelerated optimizers in the deep learning community, which often exhibits improved convergence performance over gradient descent (GD) in practice. However, theoretical investigations of NAG mainly focus on the convex setting. Since the optimization landscape of the neural network is non-convex, little is known about the convergence and acceleration of NAG. Nowadays, some works make progress towards understanding the convergence of NAG in training over-parameterized neural networks, where the number of the parameters exceeds that of the training instances. Nonetheless, previous studies are limited to the two-layer neural network, which are far from explaining the remarkable success of NAG in optimizing deep neural networks. In this paper, we investigate the convergence of NAG in training two architectures of deep linear networks: deep fully-connected linear neural networks and deep linear ResNets. Based on the over-parameterization regime, we first analyze the residual dynamics induced by the training trajectory of NAG for a deep fully-connected linear neural network under random Gaussian initialization. Our results show that NAG can converge to the global minimum at a (1-O(1/κ))t rate when the width is near-linear in the depth of the network, where t is the number of iterations and κ>1 is a constant depending on the condition number of the feature matrix. Compared to the (1-O(1/κ))t rate of GD, NAG achieves an acceleration over GD. For deep linear ResNets, we utilize the same analytical approach and obtain a similar convergence result, while the width requirement is independent of the depth. To the best of our knowledge, these are the first theoretical guarantees for the convergence and acceleration of NAG in training deep neural networks. Numerical results show the acceleration of NAG compared to GD in terms of iterations. In addition, we conduct experiments to evaluate the effect of the depth on the convergence rate of NAG, which validate our derived conditions of the width. We hope our results may shed light on understanding the optimization behavior of NAG for modern deep neural networks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A convergence analysis of Nesterov’s accelerated gradient method in training deep linear neural networks

Abstract

Talk to us

Similar Papers

More From: Information Sciences

Lead the way for us

Journal: Information Sciences	Publication Date: Sep 5, 2022
Citations: 2

Similar Papers

Neuroevolution in Deep Neural Networks: Current Trends and Future Challenges
Edgar Galvan ... Peter Mooney
IEEE Transactions on Artificial Intelligence | VOL. 2
Edgar Galvan, et. al.Edgar Galvan ... Peter Mooney
04 May 2021
IEEE Transactions on Artificial Intelligence | VOL. 2

Effects of depth, width, and initialization: A convergence analysis of layer-wise training for deep linear neural networks
Yeonjong Shin
Analysis and Applications | VOL. 20
Yeonjong ShinYeonjong Shin
31 Dec 2021
Analysis and Applications | VOL. 20

A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences.
Johannes Linder ... Alexander B Rosenberg
Cell Systems | VOL. 11
Johannes Linder, et. al.Johannes Linder ... Alexander B Rosenberg
25 Jun 2020
Cell Systems | VOL. 11

A Framework for Distributed Deep Neural Network Training with Heterogeneous Computing Platforms
Bontak Gu ... Young Geun Kim
-
Bontak Gu, et. al.Bontak Gu ... Young Geun Kim
01 Dec 2019
01 Dec 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A convergence analysis of Nesterov’s accelerated gradient method in training deep linear neural networks

Abstract

Talk to us

Similar Papers

More From: Information Sciences