Abstract

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURA ghe , a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURA ghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURA ghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s, and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18.

Highlights

  • In the last few years, Deep Convolutional Neural Networks have become the go-to solution for most tasks that require human-level understanding of data

  • As an integration to the second use-case, we present an experiment related with the acceleration of a light-weight CNN topology, to provide an insight on the possibility of accelerating with NEURAghe recent algorithms conceived for extensive workload reduction

  • The accelerator implemented in the programmable logic is controllable via software, integrating a microcontroller in charge of finely managing the basic operations of the other building blocks

Read more

Summary

INTRODUCTION

In the last few years, Deep Convolutional Neural Networks have become the go-to solution for most tasks that require human-level understanding of data. Several dedicated accelerators have been proposed in the embedded domain both from companies such as Movidius [26] and from the research community [4, 5, 9] These architectures are typically implemented as a systolic array of processing elements or more specialized engines focused on the acceleration of convolution-accumulation loops, outperforming all programmable solutions (including FPGAs) in both performance and energy efficiency thanks to the highly optimized implementation approach. It allows to implement any kind of CNN models fully exploiting the hardware and software capabilities of the Z-7045 SoC; on the other hand, it eases the porting with big performance benefits to next-generation Ultrascale+ SoC These SoCs feature a bigger and faster FPGA on the programmable logic (PL), which would allow to host two convolutional engines running at 200 MHz, and they feature a more powerful processing system (PS) based on a quad-core ARM Cortex A53 processor.

RELATED WORK
Target computational model
System architecture
Convolution-Specific Processor
Convolution Engine
Line buffers
SoP modules
Pooling and ReLU module
NEUDNN
NeuDNN front-end
NeuDNN Back-End
EXPERIMENTAL RESULTS
Hardware implementation evaluation
VGG-16
ResNet-18
GPP-accelerated layers performance analysis
Comparison with State of The Art
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call