A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks.

Zhan Li,Zhihan Zhang,Jie Hu,Qunkang Meng,Xingyu Shi,Jun Luo,Hao Wang,Qijun Huang,Sheng Chang

doi:10.1109/tnnls.2024.3423664

Abstract

The design of convolutional neural network (CNN) hardware accelerators based on a single computing engine (CE) architecture or multi-CE architecture has received widespread attention in recent years. Although this kind of hardware accelerator has advantages in hardware platform deployment flexibility and development cycle, it is still limited in resource utilization and data throughput. When processing large feature maps, the speed can usually only reach 10 frames/s, which does not meet the requirements of application scenarios, such as autonomous driving and radar detection. To solve the above problems, this article proposes a full pipeline hardware accelerator design based on pixel. By pixel-by-pixel strategy, the concept of the layer is downplayed, and the generation method of each pixel of the output feature map (Ofmap) can be optimized. To pipeline the entire computing system, we expand each layer of the neural network into hardware, eliminating the buffers between layers and maximizing the effect of complete connectivity across the entire network. This approach has yielded excellent performance. Besides that, as the pixel data stream is a fundamental paradigm in image processing, our fully pipelined hardware accelerator is universal for various CNNs (MobileNetV1, MobileNetV2 and FashionNet) in computer vision. As an example, the accelerator for MobileNetV1 achieves a speed of 4205.50 frames/s and a throughput of 4787.15 GOP/s at 211 MHz, with an output latency of 0.60 ms per image. This extremely shorts processing time and opens the door for AI's application in high-speed scenarios.

Full Text