ShortcutFusion++: Optimizing an End-to-End CNN Accelerator for High PE Utilization

Chunmyung Park,Xuan Truong Nguyen,Jicheon Kim,Hyuk-Jae Lee,Eunjae Hyun

doi:10.5573/ieiespc.2022.11.6.474

Abstract

ShorcutFusion [1] is an end-to-end framework that effectively maps many well-known deep neural networks (DNNs), such as MobileNet-v2, EfficientNet-B0, ResNet-50, and YOLO-v3, to a generic CNN accelerator on FPGA. Nevertheless, its processing elements are not fully utilized when supporting various networks, leading to relatively low hardware utilization (e.g., 68.42% for YOLO-v3). This study aimed to enhance the performance of ShortcutFusion and introduce ShortcutFusion++ by proposing two simple but effective techniques for eliminating unnecessary stalls in conventional design. First, the prefetching scheme was re-designed to avoid bubble cycles when feeding data to the PE array. Second, the output buffer was reconstructed to pipeline the operations of PEs and the process of writing output feature maps to off-chip memory. The experimental results show that ShortcutFusion++ achieves a PE utilization of 80.95% for the well-known object detection network YOLO-v3, outperforming its baseline by 12.53%.

Full Text