Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction

Marcin Pietroń,Dominik Żurek,Bartlomiej Śnieżyński

doi:10.1016/j.jocs.2023.101971

Marcin Pietroń, Dominik Żurek + Show 1 more

Open Access

PDF Available

https://doi.org/10.1016/j.jocs.2023.101971

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

This work is focused on the UnSparse-Opt framework for the efficient unstructured pruning and quantisation of feedforward neural networks and the improvement of their efficiency on graphic processing units (GPU) by using a direct sparse algorithm. The Nvidia deep neural network (cuDnn) library is the most effective implementation of deep learning (DL) algorithms for GPUs. One of the most common techniques for improving the efficiency of Convolutional Neural Network (CNN) models is weight pruning and quantisation. There are two main types of pruning: structural and non-structural. The first enables much easier acceleration on many type of accelerators, but with structural it is difficult to achieve a sparsity level and accuracy as high as that obtained with the non-structural version. Non-structural pruning with retraining can generate the weight tensors up to ∼90% or more of sparsity in some deep CNN models. In this article, the pruning algorithm is presented which achieve high sparsity levels without drop in accuracy. In the next stage, the linear and non-linear quantisation is adapted for further reductions in time and memory footprint. Additionally, this work presents real CNN models pruned with high sparsities in which some subset of layers can have comparable or better efficiency than cuDnn by using a direct sparse method. Finally, it shows sparse CNN-based architectures with reduced precision which can be more efficient than CuDnn library.

Full Text