High dynamic range (HDR) video represents a wider range of brightness, detail and colour than standard dynamic range (SDR) video. However, SDR-based VQA (Video Quality Assessment) models struggle to capture HDR distortions. In addition, some of the existing methods designed for HDR video focus on emphasising the distortion of local areas of the video frame, ignoring the distortion of the video frame as a whole. Therefore, we propose a no reference VQA model based on luminance decomposition and recombination that provides excellent performance for HDR videos, called HDR-DRVQA. Specifically, HDR-DRVQA utilises a luminance decomposition strategy to decompose video frames into different regions for explicit extraction of perceptual features in different regions of the high dynamic range. We then further propose a residual aggregation module for recombining multi-region features to extract static spatial distortion representations and dynamic motion perception (captured by feature differences). Taking advantage of the Transformer network in remote dependency modelling, this information is fed into the Transformer network for interactive learning of motion perception and adaptively constructs a stream of spatial distortion information from shallow to deep layers during temporal aggregation. We validate that our model significantly outperforms SDR VQA and existing HDR VQA methods on the publicly available HDR databases.
Read full abstract