Abstract

Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization. We propose scalable and practical natural gradient descent (SP-NGD), a principled approach for training models that allows them to attain similar generalization performance to models trained with first-order optimization methods, but with accelerated convergence. Furthermore, SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods. We evaluated SP-NGD on a benchmark task where highly optimized first-order methods are available as references: training a ResNet-50 model for image classification on ImageNet. We demonstrate convergence to a top-1 validation accuracy of 75.4 percent in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9 percent with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.

Highlights

  • A S the size of deep neural network models and the data which they are trained on continues to increase rapidly, the demand for distributed parallel computing is increasing

  • We found that Natural Gradient Descent (NGD) enables training with a fewer number of steps than what was previously thought to be possible with stochastic gradient descent (SGD), while retaining competitive generalization performance with the help of stronger data augmentation i.e. mixup [18]

  • We proposed a Scalable and Practical Natural Gradient Descent (SP-NGD), a framework which combines i) a large-scale distributed computational design with data and model hybrid parallelism for the Natural Gradient Descent (NGD) [17] and ii) practical Fisher information estimation techniques including Kronecker-Factored Approximate Curvature (K-FAC) [19], that alleviates the computational overhead of NGD over SGD

Read more

Summary

INTRODUCTION

A S the size of deep neural network models and the data which they are trained on continues to increase rapidly, the demand for distributed parallel computing is increasing. Goyal et al [3] adopt strategies such as scaling the learning rate proportional to the minibatch size, while using the first few epochs to gradually warmup the learning rate Such methods have enabled the training for mini-batch sizes of 8K, where ImageNet [4] with ResNet-50 [5] could be trained for 90 epochs with little reduction in generalization performance (76.3% top validation accuracy) in 60 minutes. More complex approaches for manipulating the learning rate were proposed, such as LARS [11], where a different learning rate is used for each layer by normalizing them with the ratio between the layer-wise norms of the weights and gradients This enabled the training with a mini-batch size of 32K without the use of ad hoc modifications, which achieved 74.9% accuracy in 14 minutes (64 epochs) [11].

RELATED WORK
Mini-batch Stochastic Learning
Natural Gradient Descent in Deep Learning
K-FAC for convolutional layers
PRACTICAL NATURAL GRADIENT
Practical FIM Estimation for BatchNorm Layers
Fast Estimation with Empirical Fisher
Unit-wise Natural Gradient
Distributed Natural Gradient
Adaptive Frequency To Refresh Statistics
Further acceleration
TRAINING FOR IMAGENET CLASSIFICATION
Data augmentation
Weights rescaling
EXPERIMENTS
Experiment Environment
Extremely Large Mini-batch Training
Scalability
Effectiveness of Practical Natural Gradient
Findings
DISCUSSION AND FUTURE
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call