Accelerating DNN Training Through Selective Localized Learning.

Sarada Krithivasan,Swagath Venkataramani,Anand Raghunathan,Sanchari Sen

doi:10.3389/fnins.2021.759807

Sarada Krithivasan, Swagath Venkataramani + Show 2 more

Open Access

https://doi.org/10.3389/fnins.2021.759807

Copy DOI

Journal: Frontiers in Neuroscience	Publication Date: Jan 11, 2022
Citations: 2	License type: CC BY 4.0

Affiliation: Purdue University West Lafayette

Abstract

Training Deep Neural Networks (DNNs) places immense compute requirements on the underlying hardware platforms, expending large amounts of time and energy. We propose LoCal+SGD, a new algorithmic approach to accelerate DNN training by selectively combining localized or Hebbian learning within a Stochastic Gradient Descent (SGD) based training framework. Back-propagation is a computationally expensive process that requires 2 Generalized Matrix Multiply (GEMM) operations to compute the error and weight gradients for each layer. We alleviate this by selectively updating some layers' weights using localized learning rules that require only 1 GEMM operation per layer. Further, since localized weight updates are performed during the forward pass itself, the layer activations for such layers do not need to be stored until the backward pass, resulting in a reduced memory footprint. Localized updates can substantially boost training speed, but need to be used judiciously in order to preserve accuracy and convergence. We address this challenge through a Learning Mode Selection Algorithm, which gradually selects and moves layers to localized learning as training progresses. Specifically, for each epoch, the algorithm identifies a Localized→SGD transition layer that delineates the network into two regions. Layers before the transition layer use localized updates, while the transition layer and later layers use gradient-based updates. We propose both static and dynamic approaches to the design of the learning mode selection algorithm. The static algorithm utilizes a pre-defined scheduler function to identify the position of the transition layer, while the dynamic algorithm analyzes the dynamics of the weight updates made to the transition layer to determine how the boundary between SGD and localized updates is shifted in future epochs. We also propose a low-cost weak supervision mechanism that controls the learning rate of localized updates based on the overall training loss. We applied LoCal+SGD to 8 image recognition CNNs (including ResNet50 and MobileNetV2) across 3 datasets (Cifar10, Cifar100, and ImageNet). Our measurements on an Nvidia GTX 1080Ti GPU demonstrate upto 1.5× improvement in end-to-end training time with ~0.5% loss in Top-1 classification accuracy.

Highlights

Deep Neural Networks (DNNs) have achieved continued success in many machine learning tasks involving images (Krizhevsky et al, 2017), videos (Ng et al, 2015), text (Zhou et al, 2015), and natural language (Goldberg and Hirst, 2017)
Back-propagation is computationally expensive, accounting for 65–75% of the total training time on GPUs. This is attributed to two key factors: (i) back propagated (BP) involves 2 Generalized Matrix Multiply (GEMM) operations per layer, one to propagate the error and the other to compute the weight gradients, and (ii) when training on distributed systems using data/model parallelism (Dean et al, 2012; Krizhevsky et al, 2012), aggregation of weight gradients/errors across devices incurs significant communication overhead
Across 8 image recognition CNNs and 3 datasets (Cifar10, Cifar100, and ImageNet), we demonstrate that LoCal+Stochastic Gradient Descent (SGD) achieves up to 1.5× improvement in training time with ∼0.5% Top-1 accuracy loss on a Nvidia GTX 1080Ti GPU

Summary

Introduction

Deep Neural Networks (DNNs) have achieved continued success in many machine learning tasks involving images (Krizhevsky et al, 2017), videos (Ng et al, 2015), text (Zhou et al, 2015), and natural language (Goldberg and Hirst, 2017). Training state-of-the-art DNN models is highly computationally expensive, often requiring exa-FLOPs of compute as the models are complex and need to be trained using large datasets. We aim to reduce the computational complexity of DNN training through a new algorithmic approach called LoCal+SGD1, which alleviates the key performance bottlenecks in Stochastic Gradient Descent (SGD) through selective use of localized learning. The training inputs, typically grouped into minibatches, are iteratively forward propagated (FP) and back propagated (BP) through the DNN layers to compute weight updates that push the network parameters in the direction that decreases the overall classification loss. This is attributed to two key factors: (i) BP involves 2 Generalized Matrix Multiply (GEMM) operations per layer, one to propagate the error and the other to compute the weight gradients, and (ii) when training on distributed systems using data/model parallelism (Dean et al, 2012; Krizhevsky et al, 2012), aggregation of weight gradients/errors across devices incurs significant communication overhead

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Accelerating DNN Training Through Selective Localized Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Neuroscience

Lead the way for us

Similar Papers

Differentially private federated learning with local momentum updates and gradients filtering
Shuaishuai Zhang ... Chuang Liang
Information Sciences | VOL. 680
Shuaishuai Zhang, et. al.Shuaishuai Zhang ... Chuang Liang
11 Jun 2024
Information Sciences | VOL. 680

Investigation of stochastic Hessian-Free optimization in Deep neural networks for speech recognition
Zhao You ... Bo Xu
-
Zhao You, et. al.Zhao You ... Bo Xu
01 Sep 2014
01 Sep 2014

Investigations on hessian-free optimization for cross-entropy training of deep neural networks
Simon Wiesler ... Jinyu Li
-
Simon Wiesler, et. al.Simon Wiesler ... Jinyu Li
25 Aug 2013
25 Aug 2013

Optimizing Deep Neural Networks Through Neuroevolution With Stochastic Gradient Descent
Haichao Zhang ... Xuesong Tang
IEEE Transactions on Cognitive and Developmental Systems | VOL. 15
Haichao Zhang, et. al.Haichao Zhang ... Xuesong Tang
01 Mar 2023
IEEE Transactions on Cognitive and Developmental Systems | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Accelerating DNN Training Through Selective Localized Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Neuroscience