The diversity and complexity of degradations in low-quality videos pose non-trivial challenges on video enhancement to reconstruct the high-quality counterparts. Prevailing sliding window based methods represent poor performance due to the limitation of window size. Recurrent networks take advantage of long-term modeling to aggregate more information, resulting in significant performance improvements. However, most of them are trained on simple degraded data and can only tackle specific degradation. To break through the limitation, we propose a progressive alignment network, namely Cross-scale Hierarchical Spatio-Temporal Transformer (CHSTT), which leverages cross-scale tokenization to generate multi-scale visual tokens in the entire sequence to capture extensive long-range temporal dependencies. To enhance the spatial and temporal interactions, we introduce an innovative hierarchical Transformer, facilitating the computation of mutual multi-head attention across both spatial and temporal dimensions. Quantitative and qualitative assessments substantiate the superior performance of CHSTT compared to several state-of-the-art benchmarks across three distinct video enhancement tasks, including video super-resolution, video denoising, and video deblurring.
Read full abstract