Energy-Efficient Architecture for CNNs Inference on Heterogeneous FPGA

Stefania Perri,Fanny Spagnolo,Fabio Frustaci,Pasquale Corsonello

doi:10.3390/jlpea10010001

Abstract

Due to the huge requirements in terms of both computational and memory capabilities, implementing energy-efficient and high-performance Convolutional Neural Networks (CNNs) by exploiting embedded systems still represents a major challenge for hardware designers. This paper presents the complete design of a heterogeneous embedded system realized by using a Field-Programmable Gate Array Systems-on-Chip (SoC) and suitable to accelerate the inference of Convolutional Neural Networks in power-constrained environments, such as those related to IoT applications. The proposed architecture is validated through its exploitation in large-scale CNNs on low-cost devices. The prototype realized on a Zynq XC7Z045 device achieves a power efficiency up to 135 Gops/W. When the VGG-16 model is inferred, a frame rate up to 11.8 fps is reached.

Highlights

Nowadays, Convolutional Neural Networks (CNNs) are exceptionally popular for being able to exceed human accuracy in plenty of applications, ranging from recognition tasks [1] such as face detection [2], object classification [1], text understanding [3] and speech recognition [4], to autonomous driving electric cars [5] and Internet of Things (IoT) devices [6]
An efficient heterogeneous SoC design was proposed to accelerate the inference of reduced precision CNNs
The novel architecture exploits an effective hardware/software partitioning, in which the computational-intensive convolutional layers (CONVs) layers are performed by a specialized hardware architecture, whereas the memory-intensive fully connected (FC) layers are executed by a software routine running on the processor

Summary

Introduction

Convolutional Neural Networks (CNNs) are exceptionally popular for being able to exceed human accuracy in plenty of applications, ranging from recognition tasks [1] such as face detection [2], object classification [1], text understanding [3] and speech recognition [4], to autonomous driving electric cars [5] and Internet of Things (IoT) devices [6]. FPGA Systems-on-Chips (SoCs) are often preferred when dealing with deep convolutional neural networks, since they offer a good balance in terms of performance, cost and power efficiency Both Xilinx [21,22] and Intel [23,24] SoC-FPGAs merge the flexibility of software routines running on a general-purpose processor with the advantages of special-purpose parallel hardware architectures. This paper presents a power-efficient heterogeneous embedded system purposely designed for real-time inference of large-scale CNNs. The proposed architecture is structured to be implemented within virtually any FPGA-based SoCs, enabling competitive speed and energy performances to be achieved when targeting low-end devices. The novel embedded systems exhibited end-to-end frame rates of 2.65 and 11.8 fps, which significantly outperform state-of-the-art implementations based on the same embedded platforms and well suit pervasive low-cost IoT applications

Background and Motivations

The Proposed SIMD CNN Accelerator

Architecture of the SIMD Buffer

Design of the SIMD CE

Implementation of the Fully Connectd Layers

Implementation of the Proposed CNN Accelerator on Heterogeneous FPGAs

Findings

Conclusions