POLAR: A Pipelined/Overlapped FPGA-Based LSTM Accelerator

Erfan Bank-Tavakoli,Massoud Pedram,Seyed Abolfazl Ghasemzadeh,Mehdi Kamal,Ali Afzali-Kusha

doi:10.1109/tvlsi.2019.2947639

Abstract

In this brief, a low resource utilization field-programmable gate array (FPGA)-based long short-term memory (LSTM) network architecture for accelerating the inference phase is presented. The architecture has low-power and high-speed features that are achieved through overlapping the timing of the operations and pipelining the datapath. Moreover, this architecture requires negligible internal memory size for storing the intermediate data leading to low resource utilization and simple routing, which provides lower interconnect delay (higher operating frequency). A designer may adjust the resource utilization (as well as the latency) of the proposed architecture readily at the register-transfer level (RTL) design by adjusting the amount of parallelization. This makes the process of mapping the architecture onto different types of FPGAs, subject to defined constraints, a simple one. The efficacy of the proposed architecture is assessed by implementing an LSTM network on different types of FPGAs. Compared with the recent works, the proposed architecture provides up to about $1.6\times $ , $43.6\times $ , $21.9\times $ , and $114.5\times $ improvements in frequency, power efficiency, GOP/s, and GOP/s/W, respectively. Finally, our proposed architecture operates at 17.64 GOP/s, which is $2.31\times $ faster than the best previously reported results.

Full Text