Deep Neural Networks (DNNs) have recently become the standard tool for solving various practical problems in a wide range of applications with state-of-the-art performance. Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) are a subset of DNNs with fully connected single or multi-layer networks. The complex neurons and internal states of LSTM networks enable them to build a memory of events, making them ideal for time series applications. Despite the great potential of LSTM networks, their heterogeneous operations and computational resource requirements create a vast gap when it comes to the fast processing time required in real-time applications using low-power, low-cost edge devices. This work proposes a novel hardware architecture that combines serial-parallel computation with matrix algebra concepts and efficient low-power computer arithmetics for LSTM network acceleration. The hardware is based on a systolic ring of outer-product-based processing elements (PEs) and a reusable single activation function block (AFB). PEs and AFB are implemented using the coordinate rotation digital computer algorithm (CORDIC) in the linear and hyperbolic modes. Unlike most approaches, the proposed hardware can be configured to perform recurrent and non-recurrent fully connected layers (FC) computations, making it suitable for various low-power edge applications. The architecture is validated on the Xilinx PYNQ-Z1 development board using an open-source time series dataset. The implemented design achieves <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$114 \mu s$</tex-math></inline-formula> average latency and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$1.8GOPS$</tex-math></inline-formula> throughput. The proposed design's low latency and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$0.438W$</tex-math></inline-formula> power consumption makes it suitable for resource-constrained edge platforms.
Read full abstract