Towards Fast GPU-based Sparse DNN Inference: A Hybrid Compute Model

Shaoxian Xu,Hai Jin,Xiaofei Liao,Xiangyu Ye,Zhiyuan Shao,Minkang Wu,Long Zheng

doi:10.1109/hpec55821.2022.9926290

Shaoxian Xu, Hai Jin + Show 5 more

https://doi.org/10.1109/hpec55821.2022.9926290

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

As the model scale of Deep Neural Networks (DNNs) increases, the memory and computational cost of DNNs become overwhelmingly large. Sparse Deep Neural Networks (SpDNNs) are promising to cope with this challenge by using fewer weights while preserving the accuracy. However, the sparsity nature of SpDNN models makes it difficult to run efficiently on GPUs. To stimulate technical advances for improving the efficiency of SpDNN inference, the MIT/IEEE/Amazon GraphChallenge proposes the SpDNN Challenge in 2019. In this paper, we present a hybrid compute model to improve the efficiency of Sparse Matrix Multiplications (SpMMs), the core computation of SpDNN inference. First, the given sparse weight matrix will be divided to generate many (sparse and dense) submatrices. For sparse submatrices, we leverage compile-time data embedding to compile the sparse data together with their corresponding computations into instructions and hence the number of random accesses can be reduced significantly. For dense submatrices, we follow the traditional computing mode where the data is obtained from the memory to exploit the high memory bandwidth of GPU. This hybrid compute model effectively balances the memory and instruction bottlenecks, and offers more scheduling opportunities to overlap computing operations and memory accesses on GPU. To determine whether a sub matrix is sparse, we present a cost model to estimate its time cost under the traditional computing mode and the data-embedded computing mode in an accurate and efficient manner. Once the computing mode for all submatrices is determined, customized codes will be generated for the SpDNN inference. Experimental results on the SpDNN Challenge benchmarks show that our approach achieves up to 197.86 tera-edges per second inference throughput on a single NVIDIA A100 GPU. Compared to the 2021 and 2020 champions, our approach offers up to 6.37x and 89.94x speedups on a single GPU, respectively. We also implement a 16-GPU version, showing up to 9.49x and 80.11x speedups over the former 16-GPU baselines of the 2021 and 2020 champions.

Full Text