Recently, many works have applied deep learning techniques to video compression tasks, achieving promising results and advancing the field of deep learning video compression (DLVC). However, the architecture design of existing DLVC is rigid and limited in terms of flexibility. Specifically, different networks must be designed for different scenarios, such as delay-constrained or non-delay-constrained scenarios. Frequent switching between networks would reduce the speed of modern deep learning platforms and increase maintenance costs. To address this problem, we propose a unified deep video compression (UVC) framework that can be freely switched to different application scenarios without changing the network architecture. Our proposed UVC framework is based on the explicit-compression and implicit-generation perspective, which contains two sub-networks: the explicit reference frame compression network (ERFCN) and the implicit reference frame generation network (IRFGN). The aim of ERFCN is to compress the current frame with the help of the reference frame. To improve the performance of ERFCN, we first introduce the Transformer in this network, which can fully remove the spatial redundancy of the input image and is beneficial for the following inter-prediction process. We also develop a novel long-range motion estimation module for inter-prediction to generate motion vectors based on global motion information between two frames, which can handle long-range complex motion relations. The aim of IRFGN is to capture the temporal relationship between forward and backward reconstructed frames and synthesize a high-quality implicit reference frame for the current frame. To achieve this, we design the split spatial-temporal attention and multi-scale prediction module. We conduct extensive experiments on three widely used video compression databases (HEVC, UVG, and MCL-JVC), and the results demonstrate the superiority of our approach over other related DLVC methods.
Read full abstract