A Flexible and Efficient FPGA Accelerator for Various Large-Scale and Lightweight CNNs

Xiao Wu,Meiqi Wang,Zhongfeng Wang,Yufei Ma

doi:10.1109/tcsi.2021.3131581

Abstract

To enable efficient deployment of convolutional neural networks (CNNs) on embedded platforms for different computer vision applications, several convolution variants have been introduced, such as depthwise convolution (DWCV), transposed convolution (TPCV), and dilated convolution (DLCV). To address the utilization degradation issue occurred in a general convolution engine for these emerging operators, a highly flexible and reconfigurable hardware accelerator is proposed to efficiently support various CNN-based vision tasks. Firstly, to avoid workload imbalance of TPCV, a zero transfer and skipping (ZTS) method is proposed to reorganize the computation process. To eliminate the redundant zero calculations of TPCV and DLCV, a sparsity-alike processing (SAP) method is proposed based on weight-oriented dataflow. Secondly, the DWCV or pooling layers are configured to be directly executed after standard convolutions without external memory accesses. Furthermore, a programmable execution schedule is introduced to gain better flexibility. Finally, the proposed accelerator is evaluated on Intel Arria 10 SoC FPGA. Experimental results show state-of-the-art performance on both large-scale and lightweight CNNs for image segmentation or classification. Specifically, the accelerator can achieve a processing speed up to 339.9 FPS and computational efficiency up to 0.58 GOPS/DSP, which is <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.3\times $ </tex-math></inline-formula> better than the prior art evaluated on the same network.

Full Text