Abstract

Long Short-Term Memory (LSTM) has been widely used in the fields of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP). In order to accelerate LSTM inference, previous works have proposed various compression methods for weight pruning. Bank-Balanced Sparsity (BBS) is an efficient compression method that has a balanced distribution of non-zero elements and negligible precision degradation. However, it costs considerable additional memory overhead to store the indices, which limits the compression ratio and poses a challenge for FPGAs with limited on-chip resources.A Shared Index Bank-Balanced Sparsity (SIBBS) compression method is presented in this paper. The rows of a weight matrix are divided into multiple bank clusters to balance the non-zero weight distribution. The banks in one cluster share the indices. The overall cost of indices is reduced by 2x–8x compared with BBS. A coarse-grained inputs similarity skipping scheme is proposed at the same time to utilize the SIBBS pruning balance, which achieves 10% LSTM operation reduction with little accuracy degradation and negligible overhead. A sparse matrix and vector multiplication architecture for the SIBBS method is proposed and the customized accelerator is implemented on the Xilinx XCKU115 FPGA. Compared with the state-of-the-art LSTM accelerators based on FPGA, our accelerator achieved a 1.47x–79.5x reduction in latency with little degradation in accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call