Acceleration and implementation of convolutional neural networks based on FPGA

Sijie Zhao,Shangshang Gao,Rugang Wang,Yuanyuan Wang,Feng Zhou,Naihong Guo

doi:10.1016/j.dsp.2023.104188

Abstract

Target detection algorithms are an important technology in the field of computer vision and usually need to be implemented on devices such as CPU or GPU with high computational power. However, traditional CPU are unable to support real-time processing and computations; the power consumption of GPU is relatively high and the application scenarios are limited. FPGA, with their high throughput rate, high power efficiency and reconfigurability, are the best platform for research on hardware acceleration of target detection algorithms. Therefore, how to implement efficient target detection algorithms on embedded devices is an important challenge. A method for YOLOv4-Tiny object detection algorithm accelerator based on FPGA is proposed and demonstrated in this paper. In this design, an optimization strategy is proposed, including the design of IP cores for convolutional computation in HLS, the fusion of batch normalization layers with convolutional layers, dynamic fixed-point 16-bit quantization, loop unrolling double-buffered storage and channel parallel acceleration, which can improve the computational performance and efficiency of the target detection algorithm. The YOLOv4-Tiny target detection algorithm was deployed on an FPGA development board, and experimental results showed that it achieved a computational performance of 18.32 GOP/s and energy efficiency of 6.66 GOP-W/s on the FPGA, an increase of 1.81 times in computational performance and 1.86 times in energy efficiency compared to the same platform. Compared to unquantized CPU platforms, the computational performance is 0.51 times better, the energy efficiency is 34 times better, and the latency of accelerating a single image is reduced to 383 ms, which allows for better detection performance in the field of target detection and tracking.

Full Text