Convolutional Neural Networks (CNNs) have demonstrated high accuracy in applications such as object detection, classification, and image processing. However, convolutional layers account for the majority of computations within CNNs. Typically, these layers are executed on GPUs, resulting in higher-power consumption and hindering lightweight deployment. This paper presents a design that deploys convolutional layers on FPGAs with adjustable parameters. In this FPGA deployment, a 4 × 4 3D sliding window is used to traverse the data, reducing bandwidth requirements and facilitating seamless integration with subsequent processing stages. A three-dimensional plane buffer design is proposed, which implements data reuse. Compared to directly inputting the feature map and performing the computation, it reduces the on-chip memory bandwidth requirement by 75%. Additionally, a new addressing strategy is introduced to map 3D feature maps to RAM addresses, eliminating addressing time. Due to the resource-intensive nature of high-level synthesis (HLS) technology, HDL design is used for the convolutional layers. This design achieves an inference speed of 121.36 GOPS at a 16-bit width, providing a 39.10 times increase in performance compared to CPU implementations.
Read full abstract