Abstract

Convolutional Neural Network (CNN) give an unmatched performance in image classification, object detection and object tracking. As many of the modern embedded systems for portable devices deals with similar tasks, they often deploy CNN based algorithms. The intensive computational workload associated with CNN inference demands powerful computing platforms like Graphics Processing Units. However, deploying CNN on mobile devices demands low power, application specific computing platforms like Field-Programmable Gate Array (FPGA) and Application-Specific Integrated Circuit (ASIC) which can work as computation accelerator units. Moreover, using certain algorithmic optimizations like using Depthwise Separable Convolution instead of standard convolution, significantly reduces the computational burden of CNN inference. This paper discusses a pipelined architecture of Depthwise Separable Convolution followed by activation and pooling operations for a single layer of CNN. The architecture is implemented on Xilinx 7 series FPGA and works at a clock period of 40ns. It can be used as a building block for an integrated system of CNN accelerator for implementation on FPGAs of different sizes. This work focuses on speeding up the convolution process, instead of implementing large design of an integrated system of CNN accelerator which makes it difficult to focus on performance of the subsystems. To the best of the knowledge of the authors, earlier works have implemented an integrated system of CNN accelerator but the blueprint for architecture of a single layer of CNN is not discussed individually, which can be a great support for the beginners in understanding FPGA based computing accelerators for CNN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call