FPGA-based High-performance CNN Accelerator Architecture with High DSP Utilization and Efficient Scheduling Mode

Qiao Yin,Hanzi Huang,Qianwu Zhang,Yunfeng Li,Hu Li,Bingyao Cao,Junjie Zhang

doi:10.1109/hpbdis49115.2020.9130576

Abstract

Due to the great increase of the on-chip block memory for the latest field programmable gate array (FPGA), the highly efficient utilization of the on-chip DSP Slice has become the bottleneck for FPGA-based convolutional neural network (CNN) hardware accelerators as the feature maps and weights can be stored in one FPGA chip. Thus, in this paper, through adopting an efficient data flow scheduling mode named a row pass and combining two weights together, two sets of 8-bit multiplication with the same activation in one DSP slice for Xilinx FPGA can be achieved compared with the traditional only one sets of 8-bit multiplication in one DSP slice. Finally, based on the proposed architecture, the CNN accelerator for the realizations of the convolution and pooling layer of AlexNet over Xilinx VCU118 FPGA platform achieves 2.8TOPS only by 2148 DSPs in 300MHZ, which outperforms the previous designs on performance density.

Full Text