Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems

Xiaofan Zhang,Yuan Ma,Volodymyr Kindratenko,Jinjun Xiong,Deming Chen,Wen-Mei W Hwu

doi:10.1109/tcad.2021.3093398

Abstract

Deep neural network (DNN)-based video analysis has become one of the most essential and challenging tasks to capture implicit information from video streams. Although DNNs significantly improve the analysis quality, they introduce intensive compute and memory demands and require dedicated hardware for efficient processing. The customized heterogeneous system is one of the promising solutions with general-purpose processors (CPUs) and specialized processors (DNN Accelerators). Among various heterogeneous systems, the combination of CPU and FPGA has been intensively studied for DNN inference with improved latency and energy consumption compared to CPU + GPU schemes and with increased flexibility and reduced time-to-market cost compared to CPU + ASIC designs. However, deploying DNN-based video analysis on CPU + FPGA systems still presents challenges from the tedious RTL programming, the intricate design verification, and the time-consuming design space exploration. To address these challenges, we present a novel framework, called EcoSys, to explore co-design and optimization opportunities on CPU-FPGA heterogeneous systems for accelerating video analysis. Novel technologies include 1) a coherent memory space shared by the host and the customized accelerator to enable efficient task partitioning and online DNN model refinement with reduced data transfer latency; 2) an end-to-end design flow that supports high-level design abstraction and allows rapid development of customized hardware accelerators from Python-based DNN descriptions; 3) a design space exploration (DSE) engine that determines the design space and explores the optimized solutions by considering the targeted heterogeneous system and user-specific constraints; and 4) a complete set of co-optimization solutions, including a layer-based pipeline, a feature map partition scheme, and an efficient memory hierarchical design for the accelerator and multithreading programming for the CPU. In this article, we demonstrate our design framework to accelerate the long-term recurrent convolution network (LRCN), which analyzes the input video and output one semantic caption for each frame. EcoSys can deliver 314.7 and 58.1 frames/s by targeting the LRCN model with AlexNet and VGG-16 backbone, respectively. Compared to the multithreaded CPU and pure FPGA design, EcoSys achieves <inline-formula> <tex-math notation="LaTeX">$20.6\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$5.3\times $ </tex-math></inline-formula> higher throughput performance.

Highlights

V IDEO content analysis is one of the most challenging applications that allows computers to understand the human world, which contains copious incomplete and non-structural information
We have seen a rapid development of DNNs for image/video recognition tasks, and among them, the longterm recurrent convolutional network (LRCN) is one of the most prominent solutions for video content analysis, such as performing activity recognition and content captioning for the input videos [1]
Assuming a compute engine (CE) with parallelism configured as cpf and kpf, cpf pairs of input and DNN weight are passed from the on-chip buffers and handled by the first process element (P E1) for multiply-accumulate (MAC) operations, and the generated results are produced along the first output channel dimension

Summary

INTRODUCTION

V IDEO content analysis is one of the most challenging applications that allows computers to understand the human world, which contains copious incomplete and non-structural information. LRCN is a powerful tool for video analysis, it involves more complex network structures and requires more intensive compute and memory demands during inference compared to a single CNN or RNN To efficiently handle such a unique DNN, customized heterogeneous systems are developed with the hardware combination of both CPUs and dedicated DNN accelerators. EcoSys proposes an end-to-end design flow to connect the Python-based high-level DNN descriptions with their board-level FPGA implementations It integrates a comprehensive design space exploration (DSE) to generate suitable task partition schemes and hardware configurations for achieving optimized performance given targeted CPU + FPGA systems. It can create a coherent memory space for the host CPU and the customized accelerator and eliminate data transfer latency between these devices.

RELATED WORK

CONTRIBUTIONS

DESIGN CHALLENGES OF ACCELERATING VIDEO ANALYSIS

Diverse DNN layers

Real-life application requirements

Hardware implementation difficulties

THE PROPOSED ECOSYS FRAMEWORK

Architecture overview

Customized accelerator design

Design space definition

Overall DSE flow

Input stage-wise compute demands

Compute resource allocation

Memory resource allocation

Accelerator performance estimation

Multi-threaded optimization

Preparation work

CAPI integration benefits

The baseline designs

The EcoSys proposed designs

Comparison results

Findings

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems	Publication Date: Jun 1, 2022
Citations: 8	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Lead the way for us

Similar Papers

Machine learning on FPGAs to face the IoT revolution
Xiaofan Zhang ... Anand Ramachandran
-
Xiaofan Zhang, et. al.Xiaofan Zhang ... Anand Ramachandran
01 Nov 2017
01 Nov 2017

Machine learning on FPGAs to face the IoT revolution
...
-
, et. al. ...
13 Nov 2017
13 Nov 2017

An Error Compensation Technique for Low-Voltage DNN Accelerators
Daehan Ji ... Jongsun Park
IEEE Transactions on Very Large Scale Integration Systems | VOL. 29
Daehan Ji, et. al.Daehan Ji ... Jongsun Park
15 Dec 2020
IEEE Transactions on Very Large Scale Integration Systems | VOL. 29

A Reconfigurable Deep Neural Network on Chip Design with Flexible Convolutional Operations
Kun-Chih Chen ... Yi-Sheng Liao
-
Kun-Chih Chen, et. al.Kun-Chih Chen ... Yi-Sheng Liao
02 Oct 2022
02 Oct 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems