Abstract

Deep neural network (DNN)-based video analysis has become one of the most essential and challenging tasks to capture implicit information from video streams. Although DNNs significantly improve the analysis quality, they introduce intensive compute and memory demands and require dedicated hardware for efficient processing. The customized heterogeneous system is one of the promising solutions with general-purpose processors (CPUs) and specialized processors (DNN Accelerators). Among various heterogeneous systems, the combination of CPU and FPGA has been intensively studied for DNN inference with improved latency and energy consumption compared to CPU &#x002B; GPU schemes and with increased flexibility and reduced time-to-market cost compared to CPU &#x002B; ASIC designs. However, deploying DNN-based video analysis on CPU &#x002B; FPGA systems still presents challenges from the tedious RTL programming, the intricate design verification, and the time-consuming design space exploration. To address these challenges, we present a novel framework, called EcoSys, to explore co-design and optimization opportunities on CPU-FPGA heterogeneous systems for accelerating video analysis. Novel technologies include 1) a coherent memory space shared by the host and the customized accelerator to enable efficient task partitioning and online DNN model refinement with reduced data transfer latency; 2) an end-to-end design flow that supports high-level design abstraction and allows rapid development of customized hardware accelerators from Python-based DNN descriptions; 3) a design space exploration (DSE) engine that determines the design space and explores the optimized solutions by considering the targeted heterogeneous system and user-specific constraints; and 4) a complete set of co-optimization solutions, including a layer-based pipeline, a feature map partition scheme, and an efficient memory hierarchical design for the accelerator and multithreading programming for the CPU. In this article, we demonstrate our design framework to accelerate the long-term recurrent convolution network (LRCN), which analyzes the input video and output one semantic caption for each frame. EcoSys can deliver 314.7 and 58.1 frames/s by targeting the LRCN model with AlexNet and VGG-16 backbone, respectively. Compared to the multithreaded CPU and pure FPGA design, EcoSys achieves <inline-formula> <tex-math notation="LaTeX">$20.6\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$5.3\times $ </tex-math></inline-formula> higher throughput performance.

Highlights

  • V IDEO content analysis is one of the most challenging applications that allows computers to understand the human world, which contains copious incomplete and non-structural information

  • We have seen a rapid development of DNNs for image/video recognition tasks, and among them, the longterm recurrent convolutional network (LRCN) is one of the most prominent solutions for video content analysis, such as performing activity recognition and content captioning for the input videos [1]

  • Assuming a compute engine (CE) with parallelism configured as cpf and kpf, cpf pairs of input and DNN weight are passed from the on-chip buffers and handled by the first process element (P E1) for multiply-accumulate (MAC) operations, and the generated results are produced along the first output channel dimension

Read more

Summary

INTRODUCTION

V IDEO content analysis is one of the most challenging applications that allows computers to understand the human world, which contains copious incomplete and non-structural information. LRCN is a powerful tool for video analysis, it involves more complex network structures and requires more intensive compute and memory demands during inference compared to a single CNN or RNN To efficiently handle such a unique DNN, customized heterogeneous systems are developed with the hardware combination of both CPUs and dedicated DNN accelerators. EcoSys proposes an end-to-end design flow to connect the Python-based high-level DNN descriptions with their board-level FPGA implementations It integrates a comprehensive design space exploration (DSE) to generate suitable task partition schemes and hardware configurations for achieving optimized performance given targeted CPU + FPGA systems. It can create a coherent memory space for the host CPU and the customized accelerator and eliminate data transfer latency between these devices.

RELATED WORK
CONTRIBUTIONS
DESIGN CHALLENGES OF ACCELERATING VIDEO ANALYSIS
Diverse DNN layers
Real-life application requirements
Hardware implementation difficulties
THE PROPOSED ECOSYS FRAMEWORK
Architecture overview
Customized accelerator design
Design space definition
Overall DSE flow
Input stage-wise compute demands
Compute resource allocation
Memory resource allocation
Accelerator performance estimation
Multi-threaded optimization
Preparation work
CAPI integration benefits
The baseline designs
The EcoSys proposed designs
Comparison results
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call