Abstract

In this paper, an FPGA-based convolutional neural network coprocessor is proposed. The coprocessor has a 1D convolutional computation unit PE in row stationary (RS) streaming mode and a 3D convolutional computation unit PE chain in pulsating array structure. The coprocessor can flexibly control the number of PE array openings according to the number of output channels of the convolutional layer. In this paper, we design a storage system with multilevel cache, and the global cache uses multiple broadcasts to distribute data to local caches and propose an image segmentation method that is compatible with the hardware architecture. The proposed coprocessor implements the convolutional and pooling layers of the VGG16 neural network model, in which the activation value, weight value, and bias value are quantized using 16-bit fixed-point quantization, with a peak computational performance of 316.0 GOP/s and an average computational performance of 62.54 GOP/s at a clock frequency of 200 MHz and a power consumption of about 9.25 W.

Highlights

  • Hardware acceleration of artificial neural networks (ANNs) has been a hot research topic since the 1990s [1, 2]

  • We provide a coprocessor implementation for convolutional neural networks, which is aimed at accelerating the convolutional and pooling layers of convolutional neural networks on FPGAs and applying them to heterogeneous accelerated systems or embedded terminals

  • The PE array designed in this paper contains 4 PE chains, and each PE chain corresponds to a channel of the input feature map, and when there are only three channels of the input feature map, the value stored in the fourth local image buffers (LIBs) is all 0 for the convenience of control

Read more

Summary

Introduction

Hardware acceleration of artificial neural networks (ANNs) has been a hot research topic since the 1990s [1, 2]. Convolutional neural networks have been proposed since 1989 and did not become a research hotspot until 2006, mainly due to the difficulty of hardware computing power at that time. All Arithmetic Logic Units (ALUs) share controllers and memory In these computing platforms, the convolutional and fully connected layers are mapped into matrix multiplication to participate in the computation. FPGAs are highly programmable and configurable, with high energy efficiency and short development cycles, especially with tools such as High Level Synthesis and OpenCL, which accelerate the development of FPGAs. Sankaradas et al designed a coprocessor for CNN based on FPGA [11] with low precision data bit-width (20-bit fixed-point quantization for weights and 16-bit fixed-point quantization for feature map values), supporting only fixed size convolutional kernel size, frequent

Reuse: Filter weights
Coprocessor Architecture
Design of Each Major Module in the Coprocessor
Result rows
A13 A14 A15
FPGA Hardware Verification
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call