Infrared video colorization can significantly improve perceptual quality by predicting reasonable colors and restoring vivid details, especially in harsh environments. However, as an inconspicuous computer vision task, there is no specialized method. Besides, directly applying current grayscale colorization methods may generate structurally obscure and temporally inconsistent frames. In this paper, we design an infrared video colorization network CPNet aims to generate visually plausible and spatial–temporal consistent colorized videos. To achieve this, a feature fusion module and a hierarchical colorizer are designed to learn the importance of each consecutive frame and the local and global correlation of the integrated features, respectively. In addition, to consolidate the temporal consistency at a fine-grained level, we further introduce a composite loss function to narrow the distance between high-level feature representations while retaining pixel-wise correspondence. Moreover, a new metric named Mean Temporal Variation Similarity (MTVS) is proposed for effectively evaluating the degree of video continuity. Comprehensive experiments conducted on the KAIST dataset demonstrate the superiority of CPNet to produce more authentic colorized videos than state-of-the-art colorization methods. In terms of quantitative comparison, CPNet achieves improvements of at least 0.89 dB on PSNR, 0.016 on SSIM, and a significant promotion on MTVS. In addition, experiments conducted on the DAVIS dataset also prove the applicability of CPNet for grayscale video colorization task.
Read full abstract