New Gradient-Weighted Adaptive Gradient Methods With Dynamic Constraints

Dong Liang,Wenyan Li,Fanfan Ma

doi:10.1109/access.2020.3002590

Abstract

Existing adaptive gradient descent optimization algorithms such as adaptive gradient (Adagrad), adaptive moment estimation (Adam), and root mean square prop (RMSprop), increase the convergence speed by dynamically adjusting the learning rate. However, in some application scenarios, the generalization ability of these adaptive gradient descent optimization algorithms is inferior compared to stochastic gradient descent (SGD). To address this problem, several improved algorithms have been recently proposed, including adaptive mean square gradient (AMSGrad) and AdaBound. In this paper, we present new variants of AdaBound and AMSBound called GWDC (Adam with weighted gradient and dynamic bound of learning rate) and AMSGWDC (AMSGrad with weighted gradient and dynamic bound of learning rate) respectively. The proposed algorithms are developed on a dynamic decay rate method that can put more memory of the recent gradients in the first moment estimation. A theoretical proof of the convergence of the proposed algorithms is also presented. In order to verify the performance of GWDC and AMSGWDC, we compare them with other popular optimization in three well-known machine learning models, i.e., feedforward neural network, convolution neural network and gated recurrent unit network. Experimental results show that the generalization performance of our proposed algorithms is better than other optimization algorithms on test data, in addition, they also show better convergence speed.

Highlights

In recent years, deep neural networks have been applied in various research fields [1]–[3]
In this paper, we propose new variants of AdaBound and AMSBound called GWDC (Adam with weighted gradient and dynamic bound of learning rate) and AMSGWDC (AMSGrad with weighted gradient and dynamic bound of learning rate), respectively, which use dynamic decay rate in the first moment estimation so that more memory is put on the recent gradients than the past gradients
Several improved algorithms have been proposed in recent work, which can achieve the characteristics of adaptive moment estimation (Adam) in the early training stage and stochastic gradient descent (SGD) in generalization ability, such as AdaBound

Summary

INTRODUCTION

Deep neural networks have been applied in various research fields [1]–[3]. Compared to non-adaptive optimization algorithms, they exhibit poor generalization ability To further address this problem, Reddi et al [22] proposed a variant of Adam which employed the idea of ‘‘long-term memory’’ of past gradients. AdaBound and AMSBound were proposed in 2019 Based on this analysis, the current optimization algorithms are still not ideal in terms of generalization ability and convergence speed. The experimental results show that compared to several adaptive and non-adaptive gradient descent optimization methods, the proposed methods exhibited a fast convergence speed, and displayed superior generalization performance.

RELATED WORK

EXPERIMENTS

CONCLUSION AND FUTURE WORK

AUXILIARY LEMMAS

Findings

AMSGWDC