Abstract

Training Deep Neural Networks (DNNs) places immense compute requirements on the underlying hardware platforms, expending large amounts of time and energy. We propose LoCal+SGD, a new algorithmic approach to accelerate DNN training by selectively combining localized or Hebbian learning within a Stochastic Gradient Descent (SGD) based training framework. Back-propagation is a computationally expensive process that requires 2 Generalized Matrix Multiply (GEMM) operations to compute the error and weight gradients for each layer. We alleviate this by selectively updating some layers' weights using localized learning rules that require only 1 GEMM operation per layer. Further, since localized weight updates are performed during the forward pass itself, the layer activations for such layers do not need to be stored until the backward pass, resulting in a reduced memory footprint. Localized updates can substantially boost training speed, but need to be used judiciously in order to preserve accuracy and convergence. We address this challenge through a Learning Mode Selection Algorithm, which gradually selects and moves layers to localized learning as training progresses. Specifically, for each epoch, the algorithm identifies a Localized→SGD transition layer that delineates the network into two regions. Layers before the transition layer use localized updates, while the transition layer and later layers use gradient-based updates. We propose both static and dynamic approaches to the design of the learning mode selection algorithm. The static algorithm utilizes a pre-defined scheduler function to identify the position of the transition layer, while the dynamic algorithm analyzes the dynamics of the weight updates made to the transition layer to determine how the boundary between SGD and localized updates is shifted in future epochs. We also propose a low-cost weak supervision mechanism that controls the learning rate of localized updates based on the overall training loss. We applied LoCal+SGD to 8 image recognition CNNs (including ResNet50 and MobileNetV2) across 3 datasets (Cifar10, Cifar100, and ImageNet). Our measurements on an Nvidia GTX 1080Ti GPU demonstrate upto 1.5× improvement in end-to-end training time with ~0.5% loss in Top-1 classification accuracy.

Highlights

  • Deep Neural Networks (DNNs) have achieved continued success in many machine learning tasks involving images (Krizhevsky et al, 2017), videos (Ng et al, 2015), text (Zhou et al, 2015), and natural language (Goldberg and Hirst, 2017)

  • Back-propagation is computationally expensive, accounting for 65–75% of the total training time on GPUs. This is attributed to two key factors: (i) back propagated (BP) involves 2 Generalized Matrix Multiply (GEMM) operations per layer, one to propagate the error and the other to compute the weight gradients, and (ii) when training on distributed systems using data/model parallelism (Dean et al, 2012; Krizhevsky et al, 2012), aggregation of weight gradients/errors across devices incurs significant communication overhead

  • Across 8 image recognition CNNs and 3 datasets (Cifar10, Cifar100, and ImageNet), we demonstrate that LoCal+Stochastic Gradient Descent (SGD) achieves up to 1.5× improvement in training time with ∼0.5% Top-1 accuracy loss on a Nvidia GTX 1080Ti GPU

Read more

Summary

Introduction

Deep Neural Networks (DNNs) have achieved continued success in many machine learning tasks involving images (Krizhevsky et al, 2017), videos (Ng et al, 2015), text (Zhou et al, 2015), and natural language (Goldberg and Hirst, 2017). Training state-of-the-art DNN models is highly computationally expensive, often requiring exa-FLOPs of compute as the models are complex and need to be trained using large datasets. We aim to reduce the computational complexity of DNN training through a new algorithmic approach called LoCal+SGD1, which alleviates the key performance bottlenecks in Stochastic Gradient Descent (SGD) through selective use of localized learning. The training inputs, typically grouped into minibatches, are iteratively forward propagated (FP) and back propagated (BP) through the DNN layers to compute weight updates that push the network parameters in the direction that decreases the overall classification loss. This is attributed to two key factors: (i) BP involves 2 Generalized Matrix Multiply (GEMM) operations per layer, one to propagate the error and the other to compute the weight gradients, and (ii) when training on distributed systems using data/model parallelism (Dean et al, 2012; Krizhevsky et al, 2012), aggregation of weight gradients/errors across devices incurs significant communication overhead

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.