Inefficiency of K-FAC for Large Batch Size Training

Linjian Ma,Kurt Keutzer,Michael Mahoney,Jiayu Ye,Amir Gholami,Gabe Montague,Zhewei Yao

doi:10.1609/aaai.v34i04.5946

Abstract

There have been several recent work claiming record times for ImageNet training. This is achieved by using large batch sizes during training to leverage parallel resources to produce faster wall-clock training times per training epoch. However, often these solutions require massive hyper-parameter tuning, which is an important cost that is often ignored. In this work, we perform an extensive analysis of large batch size training for two popular methods that is Stochastic Gradient Descent (SGD) as well as Kronecker-Factored Approximate Curvature (K-FAC) method. We evaluate the performance of these methods in terms of both wall-clock time and aggregate computational cost, and study the hyper-parameter sensitivity by performing more than 512 experiments per batch size for each of these methods. We perform experiments on multiple different models on two datasets of CIFAR-10 and SVHN. The results show that beyond a critical batch size both K-FAC and SGD significantly deviate from ideal strong scaling behaviour, and that despite common belief K-FAC does not exhibit improved large-batch scalability behavior, as compared to SGD.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Inefficiency of K-FAC for Large Batch Size Training

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence	Publication Date: Apr 3, 2020
Citations: 13

Similar Papers

Efficient Dual Batch Size Deep Learning for Distributed Parameter Server Systems
Kuan-Wei Lu ... Jan-Jan Wu
-
Kuan-Wei Lu, et. al.Kuan-Wei Lu ... Jan-Jan Wu
01 Jun 2022
01 Jun 2022

A new perspective for understanding generalization gap of deep neural networks trained with large batch sizes
Oyebade K Oyedotun ... Konstantinos Papadopoulos
Applied intelligence (Dordrecht, Netherlands) | VOL. 53
Oyebade K Oyedotun, et. al.Oyebade K Oyedotun ... Konstantinos Papadopoulos
24 Nov 2022
Applied intelligence (Dordrecht, Netherlands) | VOL. 53

Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs
Yang You ... Kurt Keutzer
IEEE Transactions on Parallel and Distributed Systems | VOL. 30
Yang You, et. al.Yang You ... Kurt Keutzer
01 Nov 2019
IEEE Transactions on Parallel and Distributed Systems | VOL. 30

Iteration and stochastic first-order oracle complexities of stochastic gradient descent using constant and decaying learning rates
Kento Imaizumi ... Hideaki Iiduka
Optimization | VOL. ahead-of-print
Kento Imaizumi, et. al.Kento Imaizumi ... Hideaki Iiduka
19 Jun 2024
Optimization | VOL. ahead-of-print

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Inefficiency of K-FAC for Large Batch Size Training

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence