Abstract

The primary function of multimedia systems is to seamlessly transform and display content to users while maintaining the perception of acceptable quality. For images and videos, perceptual quality assessment algorithms play an important role in determining what is acceptable quality and what is unacceptable from a human visual perspective. As modern image quality assessment (IQA) algorithms gain widespread adoption, it is important to achieve a balance between their computational efficiency and their quality prediction accuracy. One way to improve computational performance to meet real-time constraints is to use simplistic models of visual perception, but such an approach has a serious drawback in terms of poor-quality predictions and limited robustness to changing distortions and viewing conditions. In this paper, we investigate the advantages and potential bottlenecks of implementing a best-in-class IQA algorithm, Most Apparent Distortion, on graphics processing units (GPUs). Our results suggest that an understanding of the GPU and CPU architectures, combined with detailed knowledge of the IQA algorithm, can lead to non-trivial speedups without compromising prediction accuracy. A single-GPU and a multi-GPU implementation showed a 24× and a 33× speedup, respectively, over the baseline CPU implementation. A bottleneck analysis revealed the kernels with the highest runtimes, and a microarchitectural analysis illustrated the underlying reasons for the high runtimes of these kernels. Programs written with optimizations such as blocking that map well to CPU memory hierarchies do not map well to the GPU’s memory hierarchy. While compute unified device architecture (CUDA) is convenient to use and is powerful in facilitating general purpose GPU (GPGPU) programming, knowledge of how a program interacts with the underlying hardware is essential for understanding performance bottlenecks and resolving them.

Highlights

  • Images and videos undergo several transformations from capture to display in various formats.The key to transferring these contents over networks lies in the design of image/video analysis and processing algorithms that can simultaneously tackle two opposing goals: (1) The ability to handle potentially massive content sizes (e.g., 8 K video); while (2) achieving the results in a timely fashion on practical computing hardware

  • In this paper, we address the issue of graphics processing units (GPUs) acceleration of the perceptual and statistical processing stages used in quality assessment (QA), in the hopes to inform the design, implementation, and deployment of future multimedia QA systems

  • Obtaining performance gains through general purpose GPU (GPGPU) solutions is an attractive area of research

Read more

Summary

Introduction

Images and videos undergo several transformations from capture to display in various formats. The key novelties compared to our prior work are as follows: (1) we implemented all of MAD in CUDA, thereby allowing analyses of all of the stages (detection, appearance, memory transfers); (2) we tested three different GPUs to examine the effects of different GPU architectures; (3) we further analyzed the differences between using a single GPU vs parallelizing MAD across three GPUs (multi-GPU implementation) to investigate how the results scale with the number of GPUs; and (4) we performed a microarchitectural analysis of the key bottleneck kernel (the CUDA kernel which computes the various local statistics) to gain insight into and inform future implementations about the GPU memory infrastructure usage.

Related Work
Acceleration of QA Algorithms
GPU-based Acceleration on Other Image-Processing-Related Techniques
Description of the MAD Algorithm
Visual Detection Stage
Visual Appearance Stage
Overall MAD Score
CPU Tasks
Visual Detection Stage in CUDA
Visual Appearance Stage in CUDA
Results and Analysis
Evaluation 1
Per-Kernel Performance on the Detection Stage
Per-Kernel Performance on the Appearance Stage
Evaluation 2
2.31 Tflops
Evaluation 3
Evaluation 4
Memory Statistics—Global
Memory Statistics—Local
Memory Statistics—Caches
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call