Abstract

Nowadays the ability of Convolutional Neural Networks (CNN) to mimic the behavioral characteristics of the biological visual neuron makes it a popular choice for image identification. It comprises a deep structure and a high network that performs convolutional calculations. Most of the operations are in CONV layers, hence they are compute-intensive. The need for more computing power for CNN is a key factor in the creation of robust parallel computing platforms like GPUs, or specialized hardware accelerators. Therefore an efficient implementation of CNNs to improve performance using limited resources without accuracy reduction is a challenge. A configurable template-based single convolution layer which includes convolution, relu, and pooling sub-layers is designed and it is used to match all CONV layers of CNN. It contains a processing element array with twenty-five processing elements inside for making parallel computations and data reuse. Pipelining, Loop unrolling, and array partitioning are the techniques for increasing the speed of computing the CONV layer. This design is verified with MNIST handwritten digit image classification on a low-cost, low-memory 32-bit PYNQ-Z2 system on chip edge device. The computation time of the proposed hardware design achieved 86.14% lesser than INTEL core3 CPU, 82.24 % lesser than Haswell core2 CPU, 76.48% lesser than NVIDIA Tesla K80 GPU and 29.26% reduction in compute time when compared to the conventional accelerator. The implementation results show that there was no accuracy drop with an increase in the speed of the computations in the proposed hardware design.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call