Abstract

Long Short-Term Memory (LSTM) has been widely adopted in tasks with sequence data, such as speech recognition and language modeling. LSTM brought significant accuracy improvement by introducing additional parameters to Recurrent Neural Network (RNN). However, increasing number of parameters and computations also led to inefficiency in computing LSTM on edge devices with limited on-chip memory size and DRAM bandwidth. In order to reduce the latency and energy of LSTM computations, there has been a pressing need for model compression schemes and suitable hardware accelerators. In this paper, we first propose the Fixed Nonzero-ratio Viterbi-based Pruning, which can reduce the memory footprint of LSTM models by 96% with negligible accuracy loss. By applying additional constraints on the distribution of surviving weights in Viterbi-based Pruning, the proposed pruning scheme mitigates the load-imbalance problem and thereby increases the processing engine utilization rate. Then, we propose the V-LSTM, an efficient sparse LSTM accelerator based on the proposed pruning scheme. High compression ratio of the proposed pruning scheme allows the proposed accelerator to achieve 24.9% lower per-sample latency than that of state-of-the-art accelerators. The proposed accelerator is implemented on Xilinx VC-709 FPGA evaluation board running at 200MHz for evaluation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.