A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU

Wenqian Zhao,Yang Bai,Jiangbo Lu,Bei Yu,Wenbo Li,Haisheng Zheng,Martin D F Wong,Qi Sun,Nianjuan Jiang

doi:10.1109/tcad.2023.3241110

Wenqian Zhao, Yang Bai + Show 7 more

Open Access

PDF Available

https://doi.org/10.1109/tcad.2023.3241110

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Over the past few years, super-resolution (SR) processing has achieved astonishing progress along with the development of deep learning. Nevertheless, the rigorous requirement for real-time inference, especially for video tasks, leaves a harsh challenge for both the model architecture design and the hardware-level implementation. In this paper, we propose a hardware-aware acceleration on embedded GPU devices as a full-stack SR deployment framework. The most critical stage with dictionary learning applied in SR flow was analyzed in details and optimized with a tailored dictionary slimming strategy. Moreover, We also delve into the programming architecture of hardware while analyzing the model structure to optimize the computation kernels to reduce inference latency and maximize the throughput given restricted computing power. In addition, we further accelerate the model with 8-bit integer inference by quantizing the weights in the compressed model. An adaptive 8-bit quantization flow for SR task enables the quantized model to achieve a comparable result with the full-precision baselines. With the help of our approaches, the computation and communication bottlenecks in the deep dictionary learningbased SR models can be overcome effectively. The experiments on both edge embedded device NVIDIA NX and 2080Ti prove that our framework exceeds the performance of state-of-theart NVIDIA TensorRT significantly and can achieve real-time performance.

Full Text