Improving the computational efficiency and flexibility of FPGA-based CNN accelerator through loop optimization

Shibo Tang,Yuhao Liu,Yanhua Ma,Jie Wang,Lu Liu,Bowei Zhang

doi:10.1016/j.mejo.2024.106197

Abstract

The convolution operation consists of three-dimensional multiply-accumulate (MAC) operations within four loops, leading to a large design space to be optimized. However, prior research did not thoroughly investigate the loop optimization operations, which led to the development of accelerators that employed inefficient parallel computing architectures and hence consumed unnecessary resources. This study addresses the limitations of existing FPGA-based Convolutional Neural Network (CNN) accelerators in terms of computational efficiency and flexibility by proposing a novel scalable accelerator architecture. We first define a design space that includes loop optimization operations such as loop tiling, loop interchange, and loop unrolling. Based on this, we explore a more efficient dataflow and accelerator architecture through a quantitative analysis of the trade-off between accelerator performance and hardware costs. Then, this paper demonstrates exploring the optimal loop optimization strategy within the design space to guide the design of accelerator architectures, advancing towards the optimal solutions for accelerator performance. The effectiveness of the suggested acceleration architecture is confirmed by implementing VGG-16, ResNet-50, and ResNet-152 on Xilinx ZCU102 and Xilinx ZCU111 FPGAs. The achieved peak throughputs for the networks are 721.48 GOPS, 546.98 GOPS, and 664.66 GOPS, demonstrating outstanding performance, efficient resource usage, and flexibility.

Full Text