An Efficient Task Assignment Framework to Accelerate DPU-Based Convolutional Neural Network Inference on FPGAs

Jiang Zhu,Qingyong Deng,Shujuan Tian,Haolin Liu,Jianqi Li,Lizan Wang

doi:10.1109/access.2020.2988311

Abstract

Field Programmable Gate Array (FPGA) has become an efficient accelerator for convolutional neural network (CNN) inference due to its high performance and flexibility. To further improve the performance of CNN inference on FPGAs, an Intellectual Property core (IP core) called Deep Learning Processor Unit (DPU) is released by Xilinx. Unlike previous FPGA-based hardware designs focusing on specific functions or CNNs, the DPU IP supports ample basic functions of deep learning, and the developers can take advantage of DPUs to accelerate CNN inference conveniently. In DPU-based CNN acceleration platform, an encapsulated scheduler plays a crucial role in task scheduling between heterogeneous ARM and multiple DPUs. However, the current scheduler is unsatisfactory because its low schedule efficiency. This paper thus presents a high performance task assignment framework built upon Xilinx hybrid CPU-FPGA MPSoC devices. We first evaluate the main causes of low schedule efficiency problem. Then, we explore the scheduler rules and improve shedule efficiency through purposeful observations and analysis. Finally, we integrate our optimizations, and propose an efficient task assignment framework to maximize performance on DPU-based CNN acceleration platform. Experimental results on Xilinx Zynq UltraScale+ MPSoC zcu104 show that our efficient task assignment framework significantly boosts schedule efficiency for small-scale CNNs (from 36% to 70%), medium-scale CNNs (from 65% to 95%), and large-scale CNNs (from 77% to 99%) compared with original schedule strategy.

Highlights

RELATED WORK Considering that this study focuses on improving schedule efficiency to accelerate convolutional neural network (CNN) inference on Deep Learning Processor Unit (DPU)-based platform, this section mainly reviews the researches on the optimizations related to CNN inference on Field Programmable Gate Array (FPGA)
This study aims to present a high performance task assignment framework built upon Xilinx hybrid CPU-FPGA SoC devices with DPU IP
We first evaluate the main causes of low schedule efficiency problem

Summary

Introduction

A. BACKGROUND Convolutional neural network (CNN) has gradually replaced traditional machine vision methods in image recognition, object detection, image segmentation and many other machine vision applications because of its excellent performance [1]–[6]. Layers with different functions, which requires suitable hardware to accelerate its inference process. Many emerging fields, such as intelligent robots, unmanned aerial vehicles, autopilot cars and space probes, have imposed strict restrictions on power, delay and physical size of hardware accelerators, and traditional GPUs are hard to satisfy their requirements [7], [8]. To satisfy the above strict requirements, Field Programmable Gate Array (FPGA) has become a high performance and flexibility accelerator of CNN inference in many emerging fields [9]–[13].

Objectives

Methods

Conclusion