Modern deep convolutional neural networks (CNNs) suffer from high computational complexity due to excessive convolution operations. Recently, fast convolution algorithms such as fast Fourier transform (FFT) and Winograd transform have gained attention to address this problem. They reduce the number of multiplications required in the convolution operation by replacing it with element-wise multiplication in the transform domain. However, fast convolution-based CNN accelerators have three major concerns: expensive domain transform, large memory overhead, and limited flexibility in kernel size. In this paper, we present a novel CNN accelerator based on number theoretic transform (NTT), which overcomes the existing limitations. We propose the low-cost NTT and inverse-NTT converter that only use adders and shifters for on-chip domain transform, which solves the inflated bandwidth problem and enables more parallel computations in the accelerator. We also propose the accelerator architecture that includes multiple tile engines with the optimized data flow and mapping. Finally, we implement the proposed NTT-based CNN accelerator on the Xilinx Alveo U50 FPGA and evaluate it for popular deep CNN models. As a result, the proposed accelerator achieves 2859.5, 990.3, and 805.6 GOPS throughput for VGG-16, GoogLeNet, and Darknet-19, respectively. It outperforms the existing fast convolution-based CNN accelerators up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$9.6\times $ </tex-math></inline-formula> .
Read full abstract