Abstract

Depthwise separable convolutions (DSC) have been widely deployed in lightweight convolutional neural networks due to high efficiency. But the acceleration performance of the Graphics Processing Unit for DSC was not as well as in theory. In this paper, some approaches were proposed for accelerating DSC based on Field-Programmable Gate Array (FPGA). For the preceding layers, S2C (spatial to channel) was proposed to accelerate computing and improve the utilization rate of computational resources and bandwidth. An efficient SharePE was proposed to accelerate the DSC, which can improve the efficiency of the computing resource. The regulable parallelism approach was proposed to compute efficiently the different pointwise convolutional layers. P2D&D2P approach is proposed to reduce the external memory access. For the entire accelerating system, the pre-load workflow was proposed to reduce the waiting time of the accelerator between two images. We demonstrated our approaches on the SkyNet using the Ultra96V2 development board. Results indicated that our proposed accelerator obtained 80.030 frames per second and 0.072 Joule per image for UAV object detection, which achieved the state-of-the-art results for SkyNet. Besides, the MobileNetV2 model was implemented on a larger XC7Z100 FPGA, and the results showed our accelerator classified each picture from ImageNet in 2.69 ms. Code is available at https://github.com/AILearnerLi/DAC-SDC-2020-SEUer.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call