A low-latency LSTM accelerator using balanced sparsity based on FPGA

Jingfei Jiang,Tao Xiao,Jinwei Xu,Dong Wen,Lei Gao,Yong Dou

doi:10.1016/j.micpro.2021.104417

Jingfei Jiang, Tao Xiao + Show 4 more

Open Access

https://doi.org/10.1016/j.micpro.2021.104417

Copy DOI

Abstract

Long Short-Term Memory (LSTM) has been widely used in the fields of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP). In order to accelerate LSTM inference, previous works have proposed various compression methods for weight pruning. Bank-Balanced Sparsity (BBS) is an efficient compression method that has a balanced distribution of non-zero elements and negligible precision degradation. However, it costs considerable additional memory overhead to store the indices, which limits the compression ratio and poses a challenge for FPGAs with limited on-chip resources.A Shared Index Bank-Balanced Sparsity (SIBBS) compression method is presented in this paper. The rows of a weight matrix are divided into multiple bank clusters to balance the non-zero weight distribution. The banks in one cluster share the indices. The overall cost of indices is reduced by 2x–8x compared with BBS. A coarse-grained inputs similarity skipping scheme is proposed at the same time to utilize the SIBBS pruning balance, which achieves 10% LSTM operation reduction with little accuracy degradation and negligible overhead. A sparse matrix and vector multiplication architecture for the SIBBS method is proposed and the customized accelerator is implemented on the Xilinx XCKU115 FPGA. Compared with the state-of-the-art LSTM accelerators based on FPGA, our accelerator achieved a 1.47x–79.5x reduction in latency with little degradation in accuracy.

Full Text