Abstract

Neural network models based on LSTM have been widely used in the fields of automatic speech recognition (ASR) and natural language processing (NLP). However, to obtain higher accuracy, the computational complexity and memory footprint of LSTM need to increase continually. In order to speed up LSTM inference, some previous works propose to compress the model with weight pruning. Row-Balanced Sparsity (RBS) is one of the excellent method because of balancing distribution of nonzero elements with negligible precision degradation. However, the compression ratio of the model is greatly limited due to its large index occupancy.A Share Index Row-Balanced Sparsity(SIRBS) compression method is presented in this paper, which shares the indices between rows in one row cluster. In this way, not only the nonzero weight distribution can be balanced with a very small cost of indices, but also decoding cost can be omitted. Compared with RBS, our overhead of indices can be reduced by 2x-8x. The performance and accuracy of our method is evaluated on the CPU using PyTorch-Kaldi toolkit. When the sparsity rate is 90%, compared with dense LSTM, our compression ratio and speedup are 6.7x-8.9x and 7.0x-9.9x respectively, and the accuracy only decreases by 0.2%-1.4%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call