Abstract

Accelerators for convolutional neural networks (CNNs) based on field-programmable gate arrays (FPGAs) have drawn much attention in recent years. Two INT-8 multiplications are often implemented on a single digital signal processor (DSP) to improve DSP efficiency and performance. However, most of these works only use DSPs to implement multiplications and do not involve accumulations, which requires a lot of look-up tables (LUTs) and consequently leads to greater energy consumption. Furthermore, these works do not detail how to obtain two signed INT-8 multiplications with a DSP. This brief thus proposes a tiny accelerator for CNNs and a DSP-based processing element (PE) that is designed to implement two signed INT-8 multiply-and-accumulate (MAC) operations to reduce the overhead of LUTs. To allow the DSP to support two signed INT-8 MAC operations and avoid the large loss of accuracy caused by data overflow of partial sums (psum), we apply an efficient DSP optimization approach and a dynamic-precision quantization method. The accelerator is implemented on a Xilinx ZC706 FPGA device and achieves a DSP utilization efficiency <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.7\times $ </tex-math></inline-formula> to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$7.3\times $ </tex-math></inline-formula> greater than state-of-the-art accelerators implemented on the same device. Furthermore, our accelerator offers <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.56\times $ </tex-math></inline-formula> greater energy efficiency than other FPGA-based accelerators.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call