Damped Newton Stochastic Gradient Descent Method for Neural Networks Training

Jingcheng Zhou,Zhiming Zheng,Ruizhi Zhang,Wei Wei

doi:10.3390/math9131533

Abstract

First-order methods such as stochastic gradient descent (SGD) have recently become popular optimization methods to train deep neural networks (DNNs) for good generalization; however, they need a long training time. Second-order methods which can lower the training time are scarcely used on account of their overpriced computing cost to obtain the second-order information. Thus, many works have approximated the Hessian matrix to cut the cost of computing while the approximate Hessian matrix has large deviation. In this paper, we explore the convexity of the Hessian matrix of partial parameters and propose the damped Newton stochastic gradient descent (DN-SGD) method and stochastic gradient descent damped Newton (SGD-DN) method to train DNNs for regression problems with mean square error (MSE) and classification problems with cross-entropy loss (CEL). In contrast to other second-order methods for estimating the Hessian matrix of all parameters, our methods only accurately compute a small part of the parameters, which greatly reduces the computational cost and makes the convergence of the learning process much faster and more accurate than SGD and Adagrad. Several numerical experiments on real datasets were performed to verify the effectiveness of our methods for regression and classification problems.

Highlights

Accepted: 25 June 2021First-order methods are popularly used to train deep neural networks (DNNs), such as stochastic gradient descent (SGD) [1] and its variants which use momentum and acceleration [2] and an adaptive learning rate [3]
We propose the damped Newton stochastic gradient descent (DN-SGD) and stochastic gradient descent damped Newton (SGD-DN) algorithms, which let the parameters of the last layer iterate with the variational damped Newton method
It can be seen that DN-SGD and SGD-DN are always faster than SGD and Adagrad in terms of both steps and time, which is consistent with the provided analysis

Summary

Introduction

First-order methods are popularly used to train deep neural networks (DNNs), such as stochastic gradient descent (SGD) [1] and its variants which use momentum and acceleration [2] and an adaptive learning rate [3]. SGD calculates the gradient on only a small batch instead of the whole training data Such randomness introduced by sampling the small batch can lead to the better generalization of the DNNs [4]. The main problem here is that it is practically impossible to compute and invert a full Hessian matrix due to the massive parameters of DNNs and the Hessian matrix is not always positive definite [7]. Efforts to conquer this problem include Kronecker-factored approximate [8,9,10], Hessian-free inexact Newton. They lose a part of information by obtaining the approximate Hessian matrix, and further loss by adding regular terms to make the Hessian matrix positive definite

Our Contributions

Related Work

Feed-Forward Neural Networks

Convexity of Partial Parameters of the Loss Function

Our Innovation

Defect of Methods Approximating the Hessian Matrix

Set λ Precisely

Last Layer Makes Front Layers Converge Better

Algorithm

Regression Problem

Classification Problem

Discussion of Results

Conclusions and Future Research

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematics	Publication Date: Jun 29, 2021
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Damped Newton Stochastic Gradient Descent Method for Neural Networks Training

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics

Lead the way for us

Similar Papers

Weighted SGD for ep regression with randomized preconditioning
...
-
, et. al. ...
10 Jan 2016
10 Jan 2016

Weighted SGD for ℓ p Regression with Randomized Preconditioning.
Jiyan Yang ... Michael W Mahoney
Proceedings of the ... Annual ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Symposium on Discrete Algorithms | VOL. 2016
Jiyan Yang, et. al.Jiyan Yang ... Michael W Mahoney
21 Dec 2015
Proceedings of the ... Annual ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Symposium on Discrete Algorithms | VOL. 2016

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods
Nicolas Loizou ... Peter Richtárik
Computational Optimization and Applications | VOL. 77
Nicolas Loizou, et. al.Nicolas Loizou ... Peter Richtárik
23 Sep 2020
Computational Optimization and Applications | VOL. 77

Principal Component Analysis Based on t-Distribution Hunting Search Algorithm
Dan Li ... Zan Yang
-
Dan Li, et. al.Dan Li ... Zan Yang
15 Jul 2022
15 Jul 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Damped Newton Stochastic Gradient Descent Method for Neural Networks Training

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics