Abstract

Due to the great increase of the on-chip block memory for the latest field programmable gate array (FPGA), the highly efficient utilization of the on-chip DSP Slice has become the bottleneck for FPGA-based convolutional neural network (CNN) hardware accelerators as the feature maps and weights can be stored in one FPGA chip. Thus, in this paper, through adopting an efficient data flow scheduling mode named a row pass and combining two weights together, two sets of 8-bit multiplication with the same activation in one DSP slice for Xilinx FPGA can be achieved compared with the traditional only one sets of 8-bit multiplication in one DSP slice. Finally, based on the proposed architecture, the CNN accelerator for the realizations of the convolution and pooling layer of AlexNet over Xilinx VCU118 FPGA platform achieves 2.8TOPS only by 2148 DSPs in 300MHZ, which outperforms the previous designs on performance density.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.