Self-attention-based Transformer has achieved great success in many computer vision tasks. However, its application to blind video quality assessment (VQA) is far from comprehensive. Evaluating the quality of in-the-wild videos is challenging due to the unknown of pristine reference and shooting distortion. This paper presents a Co-trained Space–Time Attention network for the blind VQA problem, termed CoSTA. Specifically, we first build CoSTA by alternately concatenating the divided space–time attention. Then, to facilitate the training of CoSTA, we design a vectorized regression loss by encoding the mean opinion score (MOS) to the probability vector and embedding a special token as the learnable variable of MOS, leading to the better fitting of the human rating process. Finally, to solve the data-hungry problem within Transformer, we propose to co-train the spatial and temporal attention weights using both images and videos. Various experiments are conducted on the de-facto in-the-wild video datasets, including LIVE-Qualcomm, LIVE-VQC, KoNViD-1k, YouTube-UGC, LSVQ, LSVQ-1080p, and DVL2021. Experimental results demonstrate the superiority of the proposed CoSTA over the state-of-the-art. The source code is publicly available at https://github.com/GZHU-DVL/CoSTA.
Read full abstract