A Framework for Generating High Throughput CNN Implementations on FPGAs

Hanqing Zeng,Ren Chen,Chi Zhang,Viktor Prasanna

doi:10.1145/3174243.3174265

Abstract

We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic \textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs wasted computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the wasted computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable \textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8\times$ (AlexNet) and $4.9\times$ (VGG16) improvement compared with the state-of-the-art implementations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Framework for Generating High Throughput CNN Implementations on FPGAs

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Automatic Bus Matrix Synthesis based on Hardware Interface Selection for Fast Communication Design Space Exploration
Ganghee Lee ... Seokhyun Lee
-
Ganghee Lee, et. al.Ganghee Lee ... Seokhyun Lee
01 Jul 2007
01 Jul 2007

System Analysis of VLSI Architecture for 5/3 and 1/3 Motion-Compensated Temporal Filtering
C.-Y Chen ... Y.-H Chen
IEEE Transactions on Signal Processing | VOL. 54
C.-Y Chen, et. al.C.-Y Chen ... Y.-H Chen
01 Oct 2006
System Analysis of VLSI Architecture for 5/3 and 1/3 Motion-Compensated Temporal Filtering
C.-Y Chen ... Y.-H Chen

A Flexible Design Automation Tool for Accelerating Quantized Spectral CNNs
Rachit Rajat ... Viktor Prasanna
-
Rachit Rajat, et. al.Rachit Rajat ... Viktor Prasanna
01 Sep 2019
01 Sep 2019

Mapping Large LSTMs to FPGAs with Weight Reuse
Zhiqiang Que ... Hongxiang Fan
Journal of Signal Processing Systems | VOL. 92
Zhiqiang Que, et. al.Zhiqiang Que ... Hongxiang Fan
09 Jul 2020
Journal of Signal Processing Systems | VOL. 92

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Framework for Generating High Throughput CNN Implementations on FPGAs

Abstract

Talk to us

Similar Papers