Abstract

Recurrent Neural Network (RNN) is a key technology for sequential applications which require efficient and realtime implementations. Despite its popularity, efficient acceleration for RNN inference is challenging due to its recurrent nature and data dependencies. This paper proposes a multi-threaded neural processing unit (NPU) for RNN/LSTM inferences to increase processing abilities and quality of service of cloud-based NPUs by improving their hardware utilization. Besides, a custom coarse-grained multi-threaded LSTM (CGMT-LSTM) hardware architecture is introduced, which switches tasks among threads when LSTM computational kernels meet data hazard. These logical NPUs share nearly all resources of the physical NPU. When one logical NPU is stalled, another one can make progress. These optimizations improve the exploitation of parallelism to increase hardware utilization and enhance system throughput. Evaluation results show that a dual-threaded CGMT-LSTM NPU gains 27% more performance while only has 3.8% more area than a single-threaded one using a Stratix 10 FPGA. When compared with an implementation on the Tesla V100 GPU, our novel hardware architecture is 6.62 times faster and 15.88 times higher power efficiency, which demonstrates that our approach contributes to high performance energy-efficient FPGA-based multi-LSTM inference systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call