Scalable and Practical Natural Gradient for Large-Scale Deep Learning.

Kazuki Osawa,Yohei Tsuji,Yuichiro Ueno,Rio Yokota,Chuan-Sheng Foo,Akira Naruse

doi:10.1109/tpami.2020.3004354

Abstract

Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization. We propose scalable and practical natural gradient descent (SP-NGD), a principled approach for training models that allows them to attain similar generalization performance to models trained with first-order optimization methods, but with accelerated convergence. Furthermore, SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods. We evaluated SP-NGD on a benchmark task where highly optimized first-order methods are available as references: training a ResNet-50 model for image classification on ImageNet. We demonstrate convergence to a top-1 validation accuracy of 75.4 percent in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9 percent with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.

Highlights

A S the size of deep neural network models and the data which they are trained on continues to increase rapidly, the demand for distributed parallel computing is increasing
We found that Natural Gradient Descent (NGD) enables training with a fewer number of steps than what was previously thought to be possible with stochastic gradient descent (SGD), while retaining competitive generalization performance with the help of stronger data augmentation i.e. mixup [18]
We proposed a Scalable and Practical Natural Gradient Descent (SP-NGD), a framework which combines i) a large-scale distributed computational design with data and model hybrid parallelism for the Natural Gradient Descent (NGD) [17] and ii) practical Fisher information estimation techniques including Kronecker-Factored Approximate Curvature (K-FAC) [19], that alleviates the computational overhead of NGD over SGD

Summary

INTRODUCTION

A S the size of deep neural network models and the data which they are trained on continues to increase rapidly, the demand for distributed parallel computing is increasing. Goyal et al [3] adopt strategies such as scaling the learning rate proportional to the minibatch size, while using the first few epochs to gradually warmup the learning rate Such methods have enabled the training for mini-batch sizes of 8K, where ImageNet [4] with ResNet-50 [5] could be trained for 90 epochs with little reduction in generalization performance (76.3% top validation accuracy) in 60 minutes. More complex approaches for manipulating the learning rate were proposed, such as LARS [11], where a different learning rate is used for each layer by normalizing them with the ratio between the layer-wise norms of the weights and gradients This enabled the training with a mini-batch size of 32K without the use of ad hoc modifications, which achieved 74.9% accuracy in 14 minutes (64 epochs) [11].

RELATED WORK

Mini-batch Stochastic Learning

Natural Gradient Descent in Deep Learning

K-FAC for convolutional layers

PRACTICAL NATURAL GRADIENT

Practical FIM Estimation for BatchNorm Layers

Fast Estimation with Empirical Fisher

Unit-wise Natural Gradient

Distributed Natural Gradient

Adaptive Frequency To Refresh Statistics

Further acceleration

TRAINING FOR IMAGENET CLASSIFICATION

Data augmentation

Weights rescaling

EXPERIMENTS

Experiment Environment

Extremely Large Mini-batch Training

Scalability

Effectiveness of Practical Natural Gradient

Findings

DISCUSSION AND FUTURE

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence	Publication Date: Jun 23, 2020
Citations: 18	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Scalable and Practical Natural Gradient for Large-Scale Deep Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence

Lead the way for us

Similar Papers

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks
Kazuki Osawa ... Satoshi Matsuoka
-
Kazuki Osawa, et. al.Kazuki Osawa ... Satoshi Matsuoka
01 Jun 2019
01 Jun 2019

Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks
Saurabh Singh ... Shankar Krishnan
-
Saurabh Singh, et. al.Saurabh Singh ... Shankar Krishnan
01 Jun 2020
01 Jun 2020

Neuroevolution in Deep Neural Networks: Current Trends and Future Challenges
Edgar Galvan ... Peter Mooney
IEEE transactions on artificial intelligence | VOL. 2
Edgar Galvan, et. al.Edgar Galvan ... Peter Mooney
04 May 2021
IEEE transactions on artificial intelligence | VOL. 2

Data optimization for large batch distributed training of deep neural networks
Shubhankar Gahlot ... Mallikarjun Arjun Shankar
-
Shubhankar Gahlot, et. al.Shubhankar Gahlot ... Mallikarjun Arjun Shankar
01 Dec 2020
01 Dec 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scalable and Practical Natural Gradient for Large-Scale Deep Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence