Abstract

ShorcutFusion [1] is an end-to-end framework that effectively maps many well-known deep neural networks (DNNs), such as MobileNet-v2, EfficientNet-B0, ResNet-50, and YOLO-v3, to a generic CNN accelerator on FPGA. Nevertheless, its processing elements are not fully utilized when supporting various networks, leading to relatively low hardware utilization (e.g., 68.42% for YOLO-v3). This study aimed to enhance the performance of ShortcutFusion and introduce ShortcutFusion++ by proposing two simple but effective techniques for eliminating unnecessary stalls in conventional design. First, the prefetching scheme was re-designed to avoid bubble cycles when feeding data to the PE array. Second, the output buffer was reconstructed to pipeline the operations of PEs and the process of writing output feature maps to off-chip memory. The experimental results show that ShortcutFusion++ achieves a PE utilization of 80.95% for the well-known object detection network YOLO-v3, outperforming its baseline by 12.53%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call