Efficient Sparse Neural Networks Using Regularized Multi Block Sparsity Pattern on a GPU

Dharma Teja Vooturi,Kishore Kothapalli

doi:10.1109/hipc.2019.00035

Abstract

A large portion of the computation in sparse neural networks comprises of multiplying a sparse matrix with a dense matrix, denoted SDMM in this paper. The SDMM operation with an unstructured sparsity pattern cannot be efficiently processed on modern architectures such as GPUs due to irregularity in compute and memory accesses. However, efficient parallel algorithms on a GPU can be designed for SDMM when the sparsity pattern is more structured. Thus, the run time performance of sparse neural networks on a GPU is dependent on the sparsity pattern present in the underlying matrix. In sparse neural networks obtained using pruning based approaches, the choice of sparsity pattern not only effects the run time, but also effects the accuracy of the task for which the neural network is trained for. Sparsity patterns which have a good run time performance on a GPU may not have a good accuracy and vice-versa. The real challenge then is to given a target architecture, identify a sparsity pattern, a storage format, and an algorithm for SDMM that leads to sparse neural networks which are efficient in both run time and accuracy. In this work, we propose a novel, structured, flexible, and generic sparsity pattern called the RMB (Regularized Multi Block) sparsity pattern, and an efficient storage format (CRMB), and a fast GPU algorithm for processing RMBMM (SDMM with the multiplicand having RMB sparsity pattern). Using the RMB sparsity pattern, we achieve better trade-offs between the accuracy, and the run time performance of sparse neural networks on a GPU when compared to commonly used sparsity patterns like unstructured, and block sparsity patterns.

Full Text