Abstract
In recent years, Deep Neural Networks (DNNs) have achieved state-of-the-art results in various fields like Computer Vision, Natural Language Processing and Speech Recognition. Of all the DNN architectures, Convolutional Neural Networks (CNNs) have been most effective in tasks like image classification and object detection. The high performance of the CNNs comes at the cost of computational complexity. Currently Graphics Processing Units (GPUs) are used to accelerate CNN training and inference on workstations and data servers. Though popular, GPUs are not suitable for embedded applications because they are not energy efficient. ASIC and FPGA accelerators have the potential to run CNNs that are optimized for energy and performance.In this paper we present an architecture which takes a novel approach to compute convolution results using row-wise inputs as opposed to traditional tile-based processing. We are able to exceed the results of state of the art architectures when implemented on an inexpensive PYNQ Z1 board running at 100Mhz. The total latency to run the convolution layers in the VGG16 benchmark is nearly 1.5x lower for our architecture than state of the art architectures.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.