A Bounded Scheduling Method for Adaptive Gradient Methods

Mingxing Tang,Yuan Yuan,Zhen Huang,Yuxing Peng,Changjian Wang

doi:10.3390/app9173569

Mingxing Tang, Yuan Yuan + Show 3 more

Open Access

https://doi.org/10.3390/app9173569

Copy DOI

Journal: Applied Sciences	Publication Date: Sep 1, 2019
Citations: 6	License type: CC BY 4.0

Affiliation: National University of Defense Technology

Abstract

Many adaptive gradient methods have been successfully applied to train deep neural networks, such as Adagrad, Adadelta, RMSprop and Adam. These methods perform local optimization with an element-wise scaling learning rate based on past gradients. Although these methods can achieve an advantageous training loss, some researchers have pointed out that their generalization capability tends to be poor as compared to stochastic gradient descent (SGD) in many applications. These methods obtain a rapid initial training process but fail to converge to an optimal solution due to the unstable and extreme learning rates. In this paper, we investigate the adaptive gradient methods and get the insights on various factors that may lead to poor performance of Adam. To overcome that, we propose a bounded scheduling algorithm for Adam, which can not only improve the generalization capability but also ensure the convergence. To validate our claims, we carry out a series of experiments on the image classification and the language modeling tasks on several standard benchmarks such as ResNet, DenseNet, SENet and LSTM on typical data sets such as CIFAR-10, CIFAR-100 and Penn Treebank. Experimental results show that our method can eliminate the generalization gap between Adam and SGD, meanwhile maintaining a relative high convergence rate during training.

Highlights

Gradient MethodsMingxing Tang 1 , Zhen Huang 1, *, Yuan Yuan 2 , Changjian Wang 2 and Yuxing Peng 1
Deep neural networks (DNNs) [1] have achieved great successes in many applications, such as image recognition [2], object detection [3], speech recognition [4,5], face recognition [6] and machine translation [7]
SGD with momentum (SGDM) has the slowest convergence speed on training set and test set, but its final test accuracy is higher than Adam and Adgrad, which means that its generalization capability is better than adaptive gradient methods

Summary

Gradient Methods

Mingxing Tang 1 , Zhen Huang 1, *, Yuan Yuan 2 , Changjian Wang 2 and Yuxing Peng 1. College of Computer, National University of Defense Technology, Changsha 410073, China

Introduction

Traditional Learning Rate Methods

Adaptive Gradient Methods

Preliminaries

Specify Bounds for Adam

Schedule Bounds for Adam

Finding Minima

Converging

Uniform Scaling

Algorithm Overview

Experiments

Experimental Setup

Simple Neural Network

Deep Convolutional Network

Language Modeling

Comparison of Different Scheduling Methods

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Bounded Scheduling Method for Adaptive Gradient Methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks
Jinghui Chen ... Ziyan Yang
-
Jinghui Chen, et. al.Jinghui Chen ... Ziyan Yang
21 Dec 2018
21 Dec 2018

Exploring the use of adaptive gradient methods in effective deep learning systems
Luke Merrick ... Quanquan Gu
-
Luke Merrick, et. al.Luke Merrick ... Quanquan Gu
01 Apr 2018
01 Apr 2018

PbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization
Beitong Zhou ... Weigao Sun
-
Beitong Zhou, et. al.Beitong Zhou ... Weigao Sun
24 Dec 2019
24 Dec 2019

A random matrix theory approach to damping in deep learning
Diego Granziol ... Nicholas Baskerville
Journal of Physics: Complexity | VOL. 3
Diego Granziol, et. al.Diego Granziol ... Nicholas Baskerville
01 Jun 2022
Journal of Physics: Complexity | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Bounded Scheduling Method for Adaptive Gradient Methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences