Pipelined Training with Stale Weights in Deep Convolutional Neural Networks

Lifu Zhang,Tarek S Abdelrahman,Mehdi Keshavarz-Ghorabaee

doi:10.1155/2021/3839543

Abstract

The growth in size and complexity of convolutional neural networks (CNNs) is forcing the partitioning of a network across multiple accelerators during training and pipelining of backpropagation computations over these accelerators. Pipelining results in the use of stale weights. Existing approaches to pipelined training avoid or limit the use of stale weights with techniques that either underutilize accelerators or increase training memory footprint. This paper contributes a pipelined backpropagation scheme that uses stale weights to maximize accelerator utilization and keep memory overhead modest. It explores the impact of stale weights on the statistical efficiency and performance using 4 CNNs (LeNet-5, AlexNet, VGG, and ResNet) and shows that when pipelining is introduced in early layers, training with stale weights converges and results in models with comparable inference accuracies to those resulting from nonpipelined training (a drop in accuracy of 0.4%, 4%, 0.83%, and 1.45% for the 4 networks, respectively). However, when pipelining is deeper in the network, inference accuracies drop significantly (up to 12% for VGG and 8.5% for ResNet-20). The paper also contributes a hybrid training scheme that combines pipelined with nonpipelined training to address this drop. The potential for performance improvement of the proposed scheme is demonstrated with a proof-of-concept pipelined backpropagation implementation in PyTorch on 2 GPUs using ResNet-56/110/224/362, achieving speedups of up to 1.8X over a 1-GPU baseline.

Highlights

Machine learning (ML), in particular convolutional neural networks (CNNs), has advanced at an exponential rate over the last few years, enabled by the availability of high-performance computing devices and the abundance of data
On the one hand, limiting pipelining to early layers is often not a limitation since the early convolutional layers in the network typically contribute to the bulk of the computations and are the ones to use and benefit from pipelining. We address this drop in accuracy by a hybrid scheme that combines pipelined and nonpipelined training to maintain inference accuracy while still delivering performance improvements
We use a number of CNNs and datasets in our evaluation

Summary

Introduction

Machine learning (ML), in particular convolutional neural networks (CNNs), has advanced at an exponential rate over the last few years, enabled by the availability of high-performance computing devices and the abundance of data. The network is partitioned among multiple accelerators, typically by distributing its layers among the available accelerators, as shown in Figure 1 for an example 8-layer network. P3, and each partition is mapped to one of the 4 accelerators, A0, . Each accelerator is responsible for the computations associated with the layers mapped to it. The nature of the backpropagation algorithm used to train CNNs [10] is that the computations of a layer are performed only after the computations of the preceding layer in the forward pass of the algorithm and only after the computations of the succeeding layer in the backward pass. The computations for one batch of input data are only performed after the computations of the preceding batch have updated the parameters (i.e., weights) of the network. ese dependences underutilize the accelerators, as shown by the space-time diagram in Figure 2; only one accelerator can be active at any given point in time

Objectives

Results

Conclusion