Abstract

Standard convolutional neural networks (CNNs) have large amounts of data redundancy, and the same accuracy can be obtained even in lower bit weights instead of floating-point representation. Most CNNs have to be developed and executed on high-end GPU-based workstations, for which it is hard to transplant the existing implementations onto portable edge FPGAs because of the limitation of on-chip block memory storage size and battery capacity. In this paper, we present adaptive pointwise convolution and 2D convolution joint network (AP2D-Net), an ultra-low power and relatively high throughput system combined with dynamic precision weights and activation. Our system has high performance, and we make a trade-off between accuracy and power efficiency by adopting unmanned aerial vehicle (UAV) object detection scenarios. We evaluate our system on the Zynq UltraScale+ MPSoC Ultra96 mobile FPGA platform. The target board can get the real-time speed of 30 fps under 5.6 W, and the FPGA on-chip power is only 0.6 W. The power efficiency of our system is 2.8× better than the best system design on a Jetson TX2 GPU and 1.9× better than the design on a PYNQ-Z1 SoC FPGA.

Highlights

  • Convolutional neural network (CNN)-based deep learning (DL) algorithms are widely used in autonomous driving, natural language processing, web recommendation systems, etc., which greatly improve the quality of life of modern society

  • It takes the order of giga floating-point operations (GFLOP) to process a single image, which is far beyond the computational ability of the central processing unit (CPU) and hard to process in real-time

  • We developed the unmanned aerial vehicle (UAV) object detection system for real-time, high accuracy, and low power application combined with register-transfer level (RTL) intellectual property (IP) such as direct memory access (DMA), AXI4-stream, and digital signal processors (DSPs) to design our CNN

Read more

Summary

Introduction

Convolutional neural network (CNN)-based deep learning (DL) algorithms are widely used in autonomous driving, natural language processing, web recommendation systems, etc., which greatly improve the quality of life of modern society. For more intricate tasks, the number of CNN model parameters grows exponentially It takes the order of giga floating-point operations (GFLOP) to process a single image, which is far beyond the computational ability of the central processing unit (CPU) and hard to process in real-time. To overcome those compute-intensive tasks, researchers leverage the advantages of the graphics processing unit (GPU), such as high bandwidth and thread parallelism. The bandwidth and on-chip memory of the FPGA are limited compared with the modern GPU The design challenges such as low bandwidth and limited cache size make it hard to work in real-time.

Related Work
Optimization of the Computational Kernels
Bandwidth Optimization to Improve Throughput
Model Optimization
Binary Neural Networks
Implementation Methodologies
Proposed System Architecture and IP Block Design
AP2D-Net Modeling of the CNN-Based FPGA Accelerator
Structure of AP2D-Net
Feature Extraction
Classification and Regression
AP2D-NET System Design on FPGA
Overall Architecture of the AP2D-Net Accelerator
Optimization on a Heterogeneous System
Dataset
Training
Evaluation Criteria
AP2D-Net Modeling
Trade-Off between Working Frequency and Energy Consumption
Hardware Usage on FPGA
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.