A Uniform Architecture Design for Accelerating 2D and 3D CNNs on FPGAs

Zhiqiang Liu,Jingfei Jiang,Paul Chow,Jinwei Xu,Yong Dou,Jie Zhou

doi:10.3390/electronics8010065

Zhiqiang Liu, Jingfei Jiang + Show 4 more

Open Access

https://doi.org/10.3390/electronics8010065

Copy DOI

Abstract

Three-dimensional convolutional neural networks (3D CNNs) have gained popularity in many complicated computer vision applications. Many customized accelerators based on FPGAs are proposed for 2D CNNs, while very few are for 3D CNNs. Three-D CNNs are far more computationally intensive and the design space for 3D CNN acceleration has been further expanded since one more dimension is introduced, making it a big challenge to accelerate 3D CNNs on FPGAs. Motivated by the finding that the computation patterns of 2D and 3D CNNs are very similar, we propose a uniform architecture design for accelerating both 2D and 3D CNNs in this paper. The uniform architecture is based on the idea of mapping convolutions to matrix multiplications. A customized mapping module is developed to generate the feature matrix tilings with no need to store the entire enlarged feature matrix on-chip or off-chip, a splitting strategy is adopted to reconstruct a convolutional layer to adapt to the on-chip memory capacity, and a 2D multiply-and-accumulate (MAC) array is adopted to compute matrix multiplications efficiently. For demonstration, we implement an accelerator prototype with a high-level synthesis (HLS) methodology on a Xilinx VC709 board and test the accelerator on three typical CNN models: AlexNet, VGG16, and C3D. Experimental results show that the accelerator achieves state-of-the-art throughput performance on both 2D and 3D CNNs, with much better energy efficiency than the CPU and GPU.

Highlights

In recent years, convolutional neural networks (CNNs) have gained great success in various computer vision applications such as image classification [1], object detection [2], and face recognition [3]
Special efforts are made on memory optimizations and computations to enhance throughput performance; We analytically model the resource utilization and throughput performance of our architecture, which helps to configure an accelerator on a specific platform within certain constraints including hardware performance, memory bandwidth and clock frequency; We demonstrate the architecture design by implementing an accelerator on the Xilinx VC709 board with the High-level synthesis (HLS) methodology
We propose a uniform architecture design for accelerating both 2D and 3D CNNs based on the idea of mapping convolutions to matrix multiplication operations

Summary

Introduction

Convolutional neural networks (CNNs) have gained great success in various computer vision applications such as image classification [1], object detection [2], and face recognition [3]. CNNs have been primarily applied on 2D images to automatically extract spatial features and have significantly enhanced the image classification accuracy. To effectively incorporate the motion information in video analysis, 3D CNNs with spatiotemporal convolutional kernels are proposed. Owing to the ability to capture both spatial and temporal features, 3D CNNs have been proved to be very effective in many video-based applications including object recognition [4], hand gesture recognition [5], and human action recognition [6]. VGG16 [7], a real-life 2D CNN model for image classification with

Methods

Results

Conclusion