Abstract
Due to the growing demand for high-quality video services in 4G and 5G applications, measuring the quantitative quality of video services is expected to become a major vital task. The no-reference video quality assessment (NR-VQA) work published so far regresses computationally complex statistical transforms or convolutional neural network (CNN) features to predict a quality score. In this paper, we propose a novel NR-VQA scheme using systematic sampling of spatiotemporal planes (XY, XT, and YT) based on the high standard deviation (σ) of their high-frequency bands to represent distortion. Human visual system (HVS) is highly sensitive to structural information in visual scenes, and distortions disrupt the structural properties. The proposed scheme encodes two-level, three-dimensional structural video information using novel Local Spatiotemporal Tetra Patterns (LSTP) on the sampled highest σ planes from each block of planes. Besides, we extract quality-aware deep features from the second highest σ sampled video frames (XY-spatial) from each block using a fine-tuned CNN model. The extracted LSTP and deep quality-aware features of the two highest σ frames are average pooled and concatenated with the top hundred σ values of other frames to form video-level final features. Finally, the concatenated features are fed to a support vector regressor (SVR) to predict the perceptual quality scores of test videos. The proposed method is evaluated on ten publicly available standard exhaustive VQA databases containing synthetic, authentic, and mixed distortions. Comprehensive, robust, and extensive experiments indicate that the proposed model outperforms all the state-of-the-art VQA models and is consistent with human subjective assessment.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have