Abstract

Many adaptive gradient methods have been successfully applied to train deep neural networks, such as Adagrad, Adadelta, RMSprop and Adam. These methods perform local optimization with an element-wise scaling learning rate based on past gradients. Although these methods can achieve an advantageous training loss, some researchers have pointed out that their generalization capability tends to be poor as compared to stochastic gradient descent (SGD) in many applications. These methods obtain a rapid initial training process but fail to converge to an optimal solution due to the unstable and extreme learning rates. In this paper, we investigate the adaptive gradient methods and get the insights on various factors that may lead to poor performance of Adam. To overcome that, we propose a bounded scheduling algorithm for Adam, which can not only improve the generalization capability but also ensure the convergence. To validate our claims, we carry out a series of experiments on the image classification and the language modeling tasks on several standard benchmarks such as ResNet, DenseNet, SENet and LSTM on typical data sets such as CIFAR-10, CIFAR-100 and Penn Treebank. Experimental results show that our method can eliminate the generalization gap between Adam and SGD, meanwhile maintaining a relative high convergence rate during training.

Highlights

  • Gradient MethodsMingxing Tang 1 , Zhen Huang 1, *, Yuan Yuan 2 , Changjian Wang 2 and Yuxing Peng 1

  • Deep neural networks (DNNs) [1] have achieved great successes in many applications, such as image recognition [2], object detection [3], speech recognition [4,5], face recognition [6] and machine translation [7]

  • SGD with momentum (SGDM) has the slowest convergence speed on training set and test set, but its final test accuracy is higher than Adam and Adgrad, which means that its generalization capability is better than adaptive gradient methods

Read more

Summary

Gradient Methods

Mingxing Tang 1 , Zhen Huang 1, *, Yuan Yuan 2 , Changjian Wang 2 and Yuxing Peng 1. College of Computer, National University of Defense Technology, Changsha 410073, China

Introduction
Traditional Learning Rate Methods
Adaptive Gradient Methods
Preliminaries
Specify Bounds for Adam
Schedule Bounds for Adam
Finding Minima
Converging
Uniform Scaling
Algorithm Overview
Experiments
Experimental Setup
Simple Neural Network
Deep Convolutional Network
Language Modeling
Comparison of Different Scheduling Methods
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.