Abstract

Video quality assessment (VQA) methods focus on particular degradation types, usually artificially induced on a small set of reference videos. Hence, most traditional VQA methods under-perform in-the-wild. Deep learning approaches have had limited success due to the small size and diversity of existing VQA datasets, either artificial or authentically distorted. We introduce a new in-the-wild VQA dataset that is substantially larger and diverse: KonVid-150k. It consists of a coarsely annotated set of 153,841 videos having five quality ratings each, and 1,596 videos with a minimum of 89 ratings each. Additionally, we propose new efficient VQA approaches (MLSP-VQA) relying on multi-level spatially pooled deep-features (MLSP). They are exceptionally well suited for training at scale, compared to deep transfer learning approaches. Our best method, MLSP-VQA-FF, improves the Spearman rank-order correlation coefficient (SRCC) performance metric on the commonly used KoNViD-1k in-the-wild benchmark dataset to 0.82. It surpasses the best existing deep-learning model (0.80 SRCC) and hand-crafted feature-based method (0.78 SRCC). We further investigate how alternative approaches perform under different levels of label noise, and dataset size, showing that MLSP-VQA-FF is the overall best method for videos in-the-wild. Finally, we show that the MLSP-VQA models trained on KonVid-150k sets the new state-of-the-art for cross-test performance on KoNViD-1k and LIVE-Qualcomm with a 0.83 and 0.64 SRCC, respectively. For KoNViD-1k this inter-dataset testing outperforms intra-dataset experiments, showing excellent generalization.

Highlights

  • Videos have become a central medium for business marketing [1], with over 81% of businesses using video as a marketing tool

  • We believe that all the additional measures we have taken to refine our dataset significantly improved its ecological validity, and the performance of video quality assessment (VQA) methods trained on it in the future

  • Our learning approach (MLSP-VQA) outperforms the best existing VQA methods trained end-to-end on several datasets, and is substantially faster to train without sacrificing any predictive power

Read more

Summary

INTRODUCTION

Videos have become a central medium for business marketing [1], with over 81% of businesses using video as a marketing tool. State-of-the-art NR-VQA algorithms perform worse on in-the-wild videos than on synthetically distorted ones These methods aggregate individual video frame quality characteristics that are engineered for specific purposes, such as detecting particular compression artifacts. This presents a unique opportunity to analyze the trade-off between the number of training videos and the annotation noise/precision, in terms of the performance on the KonVid-150k-B benchmark dataset This new dataset exacerbates two problems of classical NR-VQA methods. In a short ablation study we investigate the impact of architectural and hyperparameter choices of both models Both approaches are evaluated on existing VQA datasets consisting of authentic videos as well as those containing artificially degraded videos and show that on in-the-wild videos the proposed method outperforms classical methods based on hand-crafted features. Versus the best existing 0.80 SRCC, and show excellent generalization in inter-dataset tests when trained on KonVid-150k, surpassing even the intra-dataset tests with 0.83 SRCC

RELATED WORK
ANNOTATION QUALITY
VIDEO QUALITY PREDICTION
MODEL EVALUATION
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call