Design Space Exploration for YOLO Neural Network Accelerator

Hongmin Huang,Xiaoming Xiong,Zihao Liu,Taosheng Chen,Qiming Zhang,Xianghong Hu

doi:10.3390/electronics9111921

Hongmin Huang, Xiaoming Xiong + Show 4 more

Open Access

https://doi.org/10.3390/electronics9111921

Copy DOI

Journal: Electronics	Publication Date: Nov 16, 2020
Citations: 11	License type: CC BY 4.0

Affiliation: Guangdong University of Technology

Abstract

The You Only Look Once (YOLO) neural network has great advantages and extensive applications in computer vision. The convolutional layers are the most important part of the neural network and take up most of the computation time. Improving the efficiency of the convolution operations can greatly increase the speed of the neural network. Field programmable gate arrays (FPGAs) have been widely used in accelerators for convolutional neural networks (CNNs) thanks to their configurability and parallel computing. This paper proposes a design space exploration for the YOLO neural network based on FPGA. A data block transmission strategy is proposed and a multiply and accumulate (MAC) design, which consists of two 14 × 14 processing element (PE) matrices, is designed. The PE matrices are configurable for different CNNs according to the given required functions. In order to take full advantage of the limited logical resources and the memory bandwidth on the given FPGA device and to simultaneously achieve the best performance, an improved roofline model is used to evaluate the hardware design to balance the computing throughput and the memory bandwidth requirement. The accelerator achieves 41.99 giga operations per second (GOPS) and consumes 7.50 W running at the frequency of 100 MHz on the Xilinx ZC706 board.

Highlights

Convolutional neural networks (CNNs) [1] are widely applied in a great variety of fields, such as object recognition [2,3,4,5], speech recognition [6,7], facial recognition [8,9], image recognition [10,11,12,13,14], and so on
As shown in Equation (3), the intermediate data in partial-sum registers (PRs) are transferred back to multiply and accumulate (MAC) to add the results shown in Equation (3), the intermediate data in PRs are transferred back to MAC to add the results of the later block, which can reduce the off-chip memory access and improve the data transmission of the later block, which can reduce the off-chip memory access and improve the data transmission efficiency
We propose an accelerator for You Only Look Once (YOLO) v2-tiny and implement it on the Xilinx ZC706 board with 16-/32-bit fixed points

Summary

Introduction

Convolutional neural networks (CNNs) [1] are widely applied in a great variety of fields, such as object recognition [2,3,4,5], speech recognition [6,7], facial recognition [8,9], image recognition [10,11,12,13,14], and so on. The acceleration of CNNs has been a popular topic of research and is implemented on small embedded devices. Due to a large number of computations of CNNs and the serial operation of the central processing unit (CPU), the CPU cannot fully exploit the parallel operation of CNNs. Recently, several CNN accelerators have been achieved in the graphics processor unit (GPU) [15]. GPU has efficient parallelism and a high-density computing capability [16,17], it is limited by lower energy-efficiency gain and unable to adjust the hardware resources according to various applications. The GPU is too large to be implemented on small embedded devices. Field programmable gate arrays (FPGAs), which exhibit high parallel computing and can be programmed according to specific applications, are currently widely applied to the field of hardware

Methods

Results

Conclusion