Abstract

Deep neural networks may achieve excellent performance in many research fields. However, many deep neural network models are over-parameterized. The computation of weight matrices often consumes a lot of time, which requires plenty of computing resources. In order to solve these problems, a novel block-based division method and a special coarse-grained block pruning strategy are proposed in this paper to simplify and compress the fully connected structure, and the pruned weight matrices with a blocky structure are then stored in the format of Block Sparse Row (BSR) to accelerate the calculation of the weight matrices. First, the weight matrices are divided into square sub-blocks based on spatial aggregation. Second, a coarse-grained block pruning procedure is utilized to scale down the model parameters. Finally, the BSR storage format, which is much more friendly to block sparse matrix storage and computation, is employed to store these pruned dense weight blocks to speed up the calculation. In the following experiments on MNIST and Fashion-MNIST datasets, the trend of accuracies with different pruning granularities and different sparsity is explored in order to analyze our method. The experimental results show that our coarse-grained block pruning method can compress the network and can reduce the computational cost without greatly degrading the classification accuracy. The experiment on the CIFAR-10 dataset shows that our block pruning strategy can combine well with the convolutional networks.

Highlights

  • Deep neural network architectures are becoming more complex, and the number of parameters is increasing sharply [1,2]

  • We explored the computational efficiency when our block pruning method was combined with the Block Sparse Row (BSR) format

  • This study presented a special block-based division method and coarse-grained block pruning method for fully connected structures

Read more

Summary

Introduction

Deep neural network architectures are becoming more complex, and the number of parameters is increasing sharply [1,2]. In order to reduce the number of parameters and to accelerate the computational process, many methods for neural network compression and pruning have been proposed, such as low-rank factorization [3], knowledge distillation [4], and weight sharing and connection pruning [5], etc. On the other hand, existing coarse-grained pruning methods proposed to achieve high computing efficiency are mostly specific to convolutional neural networks (CNNs). In order to address these limitations, a coarse-grained pruning method suitable for fully connected structures, which removes model redundancy and improves computing efficiency without greatly harming accuracy, will be the focus of this paper. In [6], Han et al sorted the absolute values of weights in the network and deleted connections below the threshold It reduced the number of parameters of LeNet-300-100 by 12 times.

Coarse-Grained Pruning
Sparse Matrix Storage and Computational Optimization
Block Pruning Model
Experiments and Results
Block Sparse Matrix Computation and Cache Hit Ratio Experiment
Block-Based Pruning Strategy on Convolutional Network
Conclusions and Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call