Abstract

Existing adaptive gradient descent optimization algorithms such as adaptive gradient (Adagrad), adaptive moment estimation (Adam), and root mean square prop (RMSprop), increase the convergence speed by dynamically adjusting the learning rate. However, in some application scenarios, the generalization ability of these adaptive gradient descent optimization algorithms is inferior compared to stochastic gradient descent (SGD). To address this problem, several improved algorithms have been recently proposed, including adaptive mean square gradient (AMSGrad) and AdaBound. In this paper, we present new variants of AdaBound and AMSBound called GWDC (Adam with weighted gradient and dynamic bound of learning rate) and AMSGWDC (AMSGrad with weighted gradient and dynamic bound of learning rate) respectively. The proposed algorithms are developed on a dynamic decay rate method that can put more memory of the recent gradients in the first moment estimation. A theoretical proof of the convergence of the proposed algorithms is also presented. In order to verify the performance of GWDC and AMSGWDC, we compare them with other popular optimization in three well-known machine learning models, i.e., feedforward neural network, convolution neural network and gated recurrent unit network. Experimental results show that the generalization performance of our proposed algorithms is better than other optimization algorithms on test data, in addition, they also show better convergence speed.

Highlights

  • In recent years, deep neural networks have been applied in various research fields [1]–[3]

  • In this paper, we propose new variants of AdaBound and AMSBound called GWDC (Adam with weighted gradient and dynamic bound of learning rate) and AMSGWDC (AMSGrad with weighted gradient and dynamic bound of learning rate), respectively, which use dynamic decay rate in the first moment estimation so that more memory is put on the recent gradients than the past gradients

  • Several improved algorithms have been proposed in recent work, which can achieve the characteristics of adaptive moment estimation (Adam) in the early training stage and stochastic gradient descent (SGD) in generalization ability, such as AdaBound

Read more

Summary

INTRODUCTION

Deep neural networks have been applied in various research fields [1]–[3]. Compared to non-adaptive optimization algorithms, they exhibit poor generalization ability To further address this problem, Reddi et al [22] proposed a variant of Adam which employed the idea of ‘‘long-term memory’’ of past gradients. AdaBound and AMSBound were proposed in 2019 Based on this analysis, the current optimization algorithms are still not ideal in terms of generalization ability and convergence speed. The experimental results show that compared to several adaptive and non-adaptive gradient descent optimization methods, the proposed methods exhibited a fast convergence speed, and displayed superior generalization performance.

RELATED WORK
EXPERIMENTS
CONCLUSION AND FUTURE WORK
AUXILIARY LEMMAS
Findings
AMSGWDC
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call