Training Deep Neural Networks Using Conjugate Gradient-like Methods

Hideaki Iiduka,Yu Kobayashi

doi:10.3390/electronics9111809

Abstract

The goal of this article is to train deep neural networks that accelerate useful adaptive learning rate optimization algorithms such as AdaGrad, RMSProp, Adam, and AMSGrad. To reach this goal, we devise an iterative algorithm combining the existing adaptive learning rate optimization algorithms with conjugate gradient-like methods, which are useful for constrained optimization. Convergence analyses show that the proposed algorithm with a small constant learning rate approximates a stationary point of a nonconvex optimization problem in deep learning. Furthermore, it is shown that the proposed algorithm with diminishing learning rates converges to a stationary point of the nonconvex optimization problem. The convergence and performance of the algorithm are demonstrated through numerical comparisons with the existing adaptive learning rate optimization algorithms for image and text classification. The numerical results show that the proposed algorithm with a constant learning rate is superior for training neural networks.

Highlights

IntroductionThe algorithms solve these problems in order to adapt the learning rates of the model parameters
Deep neural networks are used for many tasks, such as natural language processing, computer vision, and text and image classification, and a number of algorithms have been presented to tune the model parameters of such networks.The appropriate parameters are found by solving nonconvex stochastic optimization problems.In particular, the algorithms solve these problems in order to adapt the learning rates of the model parameters
We proposed an iterative algorithm with conjugate gradient-like directions for nonconvex optimization in deep neural networks to accelerate conventional adaptive learning rate optimization algorithms

Summary

Introduction

The algorithms solve these problems in order to adapt the learning rates of the model parameters. The convergence analyses indicated that the algorithms with sufficiently small constant learning rates approximate stationary points of the problems [8] (Theorem 3.1). This implies that useful algorithms, such as Adam and AMSGrad, can use constant learning rates to solve the nonconvex stochastic optimization problems in deep learning, in contrast to the results in [6,7] that presented only analyses assuming the convexity conditions of objective functions for diminishing learning rates. Numerical comparisons showed that the algorithms with constant learning rates perform better than the ones with diminishing learning rates

Objectives

Discussion

Conclusion