Exploring GPU acceleration of Deep Neural Networks using Block Circulant Matrices

Shi Dong,Pu Zhao,Xue Lin,David Kaeli

doi:10.1016/j.parco.2020.102701

Shi Dong, Pu Zhao + Show 2 more

Open Access

https://doi.org/10.1016/j.parco.2020.102701

Copy DOI

Journal: Parallel computing	Publication Date: Oct 16, 2020
Citations: 8	License type: elsevier-specific: oa user license

Affiliation: Northeastern University

Abstract

Training a Deep Neural Network (DNN) is a significant computing task since it places high demands on computing resources and memory bandwidth. Many approaches have been proposed to compress the network, while maintaining high model accuracy, reducing the computational demands associated with large-scale DNN training. One attractive approach is to leverage Block Circulant Matrices (BCM), compressing the linear transformation layers, e.g., convolutional and fully-connected layers, that heavily rely on performing General Matrix Multiplications (GEMM). By using BCMs, we can reduce the weight storage for a linear transformation layer from O(N2) to O(N). BCMs are also more efficient in terms of computational complexity, improving algorithmic complexity from O(N2) to O(Nlog(N)).Previous work has only evaluated DNNs using BCMs targeting FPGAs for inference. There has been little prior work that considers the potential benefits of using BCMs for accelerating DNN training on GPUs. In this paper, we explore acceleration of DNNs using BCM on a state-of-the-art GPU. First, we identify the challenges posed by using BCMs. Next, we perform both general and GPU-specific optimizations that impact: (i) the decomposition and interaction of individual operations, and (ii) the overall GPU kernel design. We modify the algorithmic steps to remove redundant computations, while maintaining mathematical integrity. We also leverage multiple GPU kernel optimizations, considering performance factors, such as occupancy, data sharing/reuse patterns, and memory coalescing. We evaluate the performance of DNN training on an NVIDIA Tesla V100, providing insights into the benefits of our proposed kernel optimizations on a state-of-the-art GPU. Based on our results, we can achieve average speedups of 1.31× and 2.79× for the convolutional layers and fully-connected layers, respectively for AlexNet. We can also achieve average speedups of 1.33× and 3.66× for the convolutional layers and fully-connected layers, respectively for VGGNet-16.

Full Text