Abstract
Standard convolutional neural networks (CNNs) have large amounts of data redundancy, and the same accuracy can be obtained even in lower bit weights instead of floating-point representation. Most CNNs have to be developed and executed on high-end GPU-based workstations, for which it is hard to transplant the existing implementations onto portable edge FPGAs because of the limitation of on-chip block memory storage size and battery capacity. In this paper, we present adaptive pointwise convolution and 2D convolution joint network (AP2D-Net), an ultra-low power and relatively high throughput system combined with dynamic precision weights and activation. Our system has high performance, and we make a trade-off between accuracy and power efficiency by adopting unmanned aerial vehicle (UAV) object detection scenarios. We evaluate our system on the Zynq UltraScale+ MPSoC Ultra96 mobile FPGA platform. The target board can get the real-time speed of 30 fps under 5.6 W, and the FPGA on-chip power is only 0.6 W. The power efficiency of our system is 2.8× better than the best system design on a Jetson TX2 GPU and 1.9× better than the design on a PYNQ-Z1 SoC FPGA.
Highlights
Convolutional neural network (CNN)-based deep learning (DL) algorithms are widely used in autonomous driving, natural language processing, web recommendation systems, etc., which greatly improve the quality of life of modern society
It takes the order of giga floating-point operations (GFLOP) to process a single image, which is far beyond the computational ability of the central processing unit (CPU) and hard to process in real-time
We developed the unmanned aerial vehicle (UAV) object detection system for real-time, high accuracy, and low power application combined with register-transfer level (RTL) intellectual property (IP) such as direct memory access (DMA), AXI4-stream, and digital signal processors (DSPs) to design our CNN
Summary
Convolutional neural network (CNN)-based deep learning (DL) algorithms are widely used in autonomous driving, natural language processing, web recommendation systems, etc., which greatly improve the quality of life of modern society. For more intricate tasks, the number of CNN model parameters grows exponentially It takes the order of giga floating-point operations (GFLOP) to process a single image, which is far beyond the computational ability of the central processing unit (CPU) and hard to process in real-time. To overcome those compute-intensive tasks, researchers leverage the advantages of the graphics processing unit (GPU), such as high bandwidth and thread parallelism. The bandwidth and on-chip memory of the FPGA are limited compared with the modern GPU The design challenges such as low bandwidth and limited cache size make it hard to work in real-time.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.