Abstract

Convolutional neural networks (CNNs) have achieved significant accuracy improvement in many intelligent applications at the cost of intensive convolution operations and massive data movements. To efficiently deploy CNNs on low power embedded platforms in real time, the depthwise separable convolution has been proposed to replace the standard convolution, especially in lightweight CNNs, which remarkably reduces the computation complexity and model size. However, it is difficult for a general convolution engine to obtain the theoretical performance improvement as the decreased data dependency of depthwise convolution significantly reduces the data reuse opportunity. To address this issue, a flexible and highperformance accelerator based on FPGA is proposed to efficiently process the inference of both large-scale and lightweight CNNs. Firstly, by sharing the activation dataflow between the depthwise convolution and pooling layers, the control logic and data bus of the two layers are reused to maximize the data utilization and minimize the logic overhead. Furthermore, these two layers can be processed either directly after standard convolutions to eliminate the external memory accesses or independently to gain better flexibility. Thirdly, a performance model is proposed to automatically explore the optimal design options of the accelerator. The proposed hardware accelerator is evaluated on Intel Arria 10 SoC FPGA and demonstrates state-of-the-art performance on both large-scale CNNs, e.g., VGG, and lightweight ones, e.g., MobileNet.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call