Abstract

As more and more deep learning tasks are pushed to mobile devices, accelerators for running these networks efficiently gain in importance. We show a that an existing class of general purpose accelerators, modulo-scheduled coarse-grained reconfigurable array (CGRA) processors typically used to accelerate multimedia workloads, can be a viable alternative to dedicated deep neural network processing hardware. To this end, an auto-tuning compiler is presented that maps convolutional neural networks (CNNs) efficiently on such architectures. The auto-tuner analyzes the structure of the CNN and the features of the CGRA, then explores the large optimization space to generate code that allows for an efficient mapping of the network. Evaluated with various CNNs, the auto-tuned code achieves an 11-fold speedup over the initial mapping. Comparing the energy per interference, the CGRA outperforms other general-purpose accelerators and an ARMv8 processor by a significant margin.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call