An End-to-End Learning Framework for Video Compression.
Traditional video compression approaches build upon the hybrid coding framework with motion-compensated prediction and residual transform coding. In this paper, we propose the first end-to-end deep video compression framework to take advantage of both the classical compression architecture and the powerful non-linear representation ability of neural networks. Our framework employs pixel-wise motion information, which is learned from an optical flow network and further compressed by an auto-encoder network to save bits. The other compression components are also implemented by the well-designed networks for high efficiency. All the modules are jointly optimized by using the rate-distortion trade-off and can collaborate with each other. More importantly, the proposed deep video compression framework is very flexible and can be easily extended by using lightweight or advanced networks for higher speed or better efficiency. We also propose to introduce the adaptive quantization layer to reduce the number of parameters for variable bitrate coding. Comprehensive experimental results demonstrate the effectiveness of the proposed framework on the benchmark datasets.
- Research Article
19
- 10.1609/aaai.v38i6.28317
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
Video compression relies heavily on exploiting the temporal redundancy between video frames, which is usually achieved by estimating and using the motion information. The motion information is represented as optical flows in most of the existing deep video compression networks. Indeed, these networks often adopt pre-trained optical flow estimation networks for motion estimation. The optical flows, however, may be less suitable for video compression due to the following two factors. First, the optical flow estimation networks were trained to perform inter-frame prediction as accurately as possible, but the optical flows themselves may cost too many bits to encode. Second, the optical flow estimation networks were trained on synthetic data, and may not generalize well enough to real-world videos. We address the twofold limitations by enhancing the optical flows in two stages: offline and online. In the offline stage, we fine-tune a trained optical flow estimation network with the motion information provided by a traditional (non-deep) video compression scheme, e.g. H.266/VVC, as we believe the motion information of H.266/VVC achieves a better rate-distortion trade-off. In the online stage, we further optimize the latent features of the optical flows with a gradient descent-based algorithm for the video to be compressed, so as to enhance the adaptivity of the optical flows. We conduct experiments on two state-of-the-art deep video compression schemes, DCVC and DCVC-DC. Experimental results demonstrate that the proposed offline and online enhancement together achieves on average 13.4% bitrate saving for DCVC and 4.1% bitrate saving for DCVC-DC on the tested videos, without increasing the model or computational complexity of the decoder side.
- Conference Article
- 10.1109/dcc52660.2022.00010
- Mar 1, 2022
Among the recent deep image compression frameworks, transform coding together with a context-adaptive entropy model is the most representative approach to achieve the best coding performance. For entropy model, 2D mask convolution is widely utilized to capture the spatial context, which omits the correlations along channel dimension. To complement to the spatial context, a cross channel context model is proposed. For transform, if given more network layers to improve its representation ability, how to allocate these network layers in forward and inverse transform is investigated. After analyzing the scheme of deep image compression connected with loop filter, we find this investigation can be regarded as a more generalized loop filter. The proposed cross channel context model and generalized loop filter (CCCMGLF) are integrated into the deep image compression framework and jointly optimized to improve the coding performance. Experimental results demonstrate that, using PSNR as distortion metric, the proposed CCCMGLF outperforms VTM-11.0 by 1.20%, 10.82% and 5.38% in terms of BD-rate reductions for Y, U and V components, respectively, for the Kodak dataset. For the JVET CTC sequences, the proposed method outperforms VTM-11.0 by 1.44% for Y but has a coding performance loss of 24.74% and 11.91% for U and V, respectively. Over the baseline deep compression framework, the proposed method provides 7.80%, 12.66% and 11.15% performance improvement for Y, U, and V, respectively, for the Kodak dataset; 9.10%, 12.27%, and 12.68% performance improvement for Y, U and V, respectively, for the JVET CTC sequences. The proposed approaches are applicable in both image compression and intra coding in video compression.
- Research Article
7
- 10.1016/j.jvcir.2022.103737
- Dec 23, 2022
- Journal of Visual Communication and Image Representation
Optimized video compression with residual split attention and swin-block artifact contraction
- Conference Article
18
- 10.1109/icip42928.2021.9506269
- Sep 19, 2021
Rate adaption is one of the decisive factors for the applications of video compression. Previous deep video compression methods are usually optimized for a single fixed rate-distortion (R-D) tradeoff. While they can achieve multiple bitrates by training multiple independent models, the achievable bitrates are limited to several discrete points on the R-D curve and the storage cost increases proportionally to the number of models. We propose a variable-rate scheme for deep video compression, which can achieve continuously variable rate by a single model, i.e., reaching any point on the R-D curve. In our scheme, two deep auto-encoders are used to compress the residual and the motion vector field respectively, which directly generate the final bitstream. The basic rate adaptation can be achieved by using the R-D tradeoff parameter to deeply modulate all the internal feature maps of the auto-encoders. In addition, other modules in our scheme, notably motion estimation and motion compensation, also affect the final bitrate indirectly. We further use the R-D tradeoff parameter to modulate them via a conditional map, thereby effectively improving the compression efficiency. We use a multi-rate-distortion loss function together with a step-by-step training strategy to optimize the entire scheme. The experimental results show the proposed scheme achieves continuously variable rate by a single model with almost the same compression efficiency as multiple fixed-rate models. The additional parameters and computation of our model are negligible when compared with a single fixed-rate model.
- Conference Article
222
- 10.1109/cvpr42600.2020.00666
- Jun 1, 2020
In this paper, we propose a Hierarchical Learned Video Compression (HLVC) method with three hierarchical quality layers and a recurrent enhancement network. The frames in the first layer are compressed by an image compression method with the highest quality. Using these frames as references, we propose the Bi-Directional Deep Compression (BDDC) network to compress the second layer with relatively high quality. Then, the third layer frames are compressed with the lowest quality, by the proposed Single Motion Deep Compression (SMDC) network, which adopts a single motion map to estimate the motions of multiple frames, thus saving bits for motion information. In our deep decoder, we develop the Weighted Recurrent Quality Enhancement (WRQE) network, which takes both compressed frames and the bit stream as inputs. In the recurrent cell of WRQE, the memory and update signal are weighted by quality features to reasonably leverage multi-frame information for enhancement. In our HLVC approach, the hierarchical quality benefits the coding efficiency, since the high quality information facilitates the compression and enhancement of low quality frames at encoder and decoder sides, respectively. Finally, the experiments validate that our HLVC approach advances the state-of-the-art of deep video compression methods, and outperforms the "Low-Delay P (LDP) very fast" mode of x265 in terms of both PSNR and MS-SSIM. The project page is at https://github.com/RenYang-home/HLVC.
- Research Article
24
- 10.1109/tip.2023.3251020
- Jan 1, 2023
- IEEE Transactions on Image Processing
In this work, we propose a new deep image compression framework called Complexity and Bitrate Adaptive Network (CBANet) that aims to learn one single network to support variable bitrate coding under various computational complexity levels. In contrast to the existing state-of-the-art learning-based image compression frameworks that only consider the rate-distortion trade-off without introducing any constraint related to the computational complexity, our CBANet considers the complex rate-distortion-complexity trade-off when learning a single network to support multiple computational complexity levels and variable bitrates. Since it is a non-trivial task to solve such a rate-distortion-complexity related optimization problem, we propose a two-step approach to decouple this complex optimization task into a complexity-distortion optimization sub-task and a rate-distortion optimization sub-task, and additionally propose a new network design strategy by introducing a Complexity Adaptive Module (CAM) and a Bitrate Adaptive Module (BAM) to respectively achieve the complexity-distortion and rate-distortion trade-offs. As a general approach, our network design strategy can be readily incorporated into different deep image compression methods to achieve complexity and bitrate adaptive image compression by using a single network. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of our CBANet for deep image compression. Code is released at https://github.com/JinyangGuo/CBANet-release.
- Conference Article
- 10.1109/icassp.1995.479908
- May 9, 1995
This paper introduces a new framework for video compression. The proposed method considers noise directly in the video sequence and seeks the optimal compression ratio and video quality. Compression is achieved by eliminating the spatial and temporal redundancies found in the intensity and motion fields of the video. Processing is performed in blocks of N frames stored in a video buffer. Encoder and decoder are synchronized prior to the transmission of a new block. A reference frame is chosen from each block and encoded before transmission. Spatial redundancies in the intensity domain are reduced by a wavelet filter. The pixel-motion field between the reference frame and other frames in a block is evaluated using a Kalman filter that estimates the pixel motion in the presence of noise. Video frames are predicted from the reference frame and the corresponding motion field. Prediction errors, motion vectors and the reference frame are compressed in wavelet domain before transmission. The compression system includes quantization and entropy coding.
- Conference Article
6
- 10.1109/icpr48806.2021.9412821
- Jan 10, 2021
In recent years optical flow is often estimated to reuse features so as to accelerate video semantic segmentation. With addition of optical flow network, however, extra cost may incur and accuracy may thus be degraded because of repeated warping operation. In this paper, we propose a boundary-aware distillation network (BDNet) that replaces optical flow network with block motion vectors encoded in compressed video, resulting in negligible computational complexity. In order to make salient features, an auxiliary boundary-aware stream is added to the main stream to jointly estimate silhouette and segmentation of objects. To further correct warped features, a well-trained teacher network is employed to transfer knowledge to the main stream. Both boundary-aware stream and the teacher network are neglected during inference stage, so that video segmentation network enables to get faster without increasing any computational burden. By splitting the task into three components, our BDNet shows almost 10% time saving as well as 1.6% accuracy improvement over baseline on the Cityscapes dataset.
- Video Transcripts
- 10.48448/vn4m-s957
- Dec 29, 2020
- Underline Science Inc.
In recent years optical flow is often estimated to reuse features so as to accelerate video semantic segmentation. With the addition of optical flow network, however, the extra cost may incur and accuracy may thus be degraded because of repeated warping operation. In this paper, we propose a boundary-aware distillation network (BDNet) that replaces the optical flow network with block motion vectors encoded in compressed video, resulting in negligible computational complexity. In order to make salient features, an auxiliary boundary-aware stream is added to the main stream to jointly estimate silhouette and segmentation of objects. To further correct warped features, a well-trained teacher network is employed to transfer knowledge to the main stream. Both boundary-aware stream and the teacher network are neglected during the inference stage, so that video segmentation network enables to get faster without increasing any computational burden. By splitting the task into three components, our BDNet shows almost 10% time saving as well as 1.6% accuracy improvement over baseline on the Cityscapes dataset.
- Research Article
13
- 10.1109/access.2020.3046040
- Dec 21, 2020
- IEEE Access
Recently, deep learning-based image compression has shown significant performance improvement in terms of coding efficiency and subjective quality. However, there has been relatively less effort on video compression based on deep neural networks. In this paper, we propose an end-to-end deep predictive video compression network, called DeepPVCnet, using mode-selective uni- and bi-directional predictions based on multi-frame hypothesis with a multi-scale structure and a temporal-context-adaptive entropy model. Our DeepPVCnet jointly compresses motion information and residual data that are generated from the multi-scale structure via the feature transformation layers. Recent deep learning-based video compression methods were proposed in a limited compression environment using only P-frame or B-frame. Learned from the lesson of the conventional video codecs, we firstly incorporate a mode-selective framework into our DeepPVCnet with uni- and bi-directional predictive modes in a rate-distortion minimization sense. Also, we propose a temporal-context-adaptive entropy model that utilizes the temporal context information of the reference frames for the current frame coding. The autoregressive entropy models for CNN-based image and video compression is difficult to compute with parallel processing. On the other hand, our temporal-context-adaptive entropy model utilizes temporally coherent context from the reference frames, so that the context information can be computed in parallel, which is computationally and architecturally advantageous. Extensive experiments show that our DeepPVCnet outperforms AVC/H.264, HEVC/H.265 and state-of-the-art methods in an MS-SSIM perspective.
- Conference Article
4
- 10.1109/dcc50243.2021.00058
- Mar 1, 2021
Rate adaption is one of the decisive factors for the applications of video compression. However, previous deep video compression methods are usually optimized for a single fixed rate-distortion (R-D) tradeoff. While they can achieve multiple bitrates by training multiple independent models, the realized bitrates are limited to several discrete points on the R-D curve and the storage cost increases proportionally to the number of models. In this paper, we propose a variable-rate scheme for deep video compression, which can achieve continuously variable rate by a single model, i.e., it can reach any point on the R-D curve. In our scheme, two deep auto-encoders are used to compress the residual and the motion vector field respectively, which directly generate the final bitstream. The basic rate adaptation can be achieved by using the R-D tradeoff parameter to deeply modulate all the internal feature maps of the auto-encoders. However, other modules in our scheme, notably motion estimation and motion compensation, also affect the final bitrate indirectly. We further use the R-D tradeoff parameter to modulate them via a conditional map, which effectively improves the compression efficiency. We use a multi-rate-distortion loss function together with a step-by-step training strategy to optimize the entire scheme. Our experiments show that the proposed scheme achieves continuously variable rate by a single model with almost the same compression efficiency as multiple fixed-rate models. The additional parameters and computation of our model are negligible when compared with a single fixed-rate model.
- Research Article
121
- 10.1109/lsp.2020.2970539
- Dec 11, 2019
- IEEE Signal Processing Letters
Variable rate is a requirement for flexible and adaptable image and video compression. However, deep image compression methods (DIC) are optimized for a single fixed rate-distortion (R-D) tradeoff. While this can be addressed by training multiple models for different tradeoffs, the memory requirements increase proportionally to the number of models. Scaling the bottleneck representation of a shared autoencoder can provide variable rate compression with a single shared autoencoder. However, the R-D performance using this simple mechanism degrades in low bitrates, and also shrinks the effective range of bitrates. To address these limitations, we formulate the problem of variable R-D optimization for DIC, and propose modulated autoencoders (MAEs), where the representations of a shared autoencoder are adapted to the specific R-D tradeoff via a modulation network. Jointly training this modulated autoencoder and the modulation network provides an effective way to navigate the R-D operational curve. Our experiments show that the proposed method can achieve almost the same R-D performance of independent models with significantly fewer parameters.
- Conference Article
4
- 10.1109/vcip56404.2022.10008883
- Dec 13, 2022
Recently, deep learning-based video compression algorithms have achieved competitive performance in Bjøntegaard delta (BD) rate, especially those adopting super-resolution networks as post-processing modules in downsampling-based video compression (DBC) frameworks. However, limited by the non-differentiable characteristics of traditional codecs, DBC frameworks mainly focus on improving the performance of super-resolution modules while ignoring optimizing downscaling modules. It is crucial to improve video compression performance without introducing additional modifications to the decoder client in practical application scenarios. We propose a context-aware processing network (CPN) compatible with standard codecs with no computational burden introduced to the client, which preserves the critical information and essential structures during downscaling. The proposed CPN works as a precoder cascaded by standard codecs to improve the compression performance on the server before encoding and transmission. Besides, a surrogate codec is employed to simulate the degradation process of the standard codecs and backpropagate the gradient to optimize the CPN. Experimental results show that the proposed method outperforms latest pre-processing networks and achieves considerable performance compared with the latest DBC frameworks.
- Research Article
- 10.1109/tcsvt.2025.3631516
- Jan 1, 2025
- IEEE Transactions on Circuits and Systems for Video Technology
Different from natural videos, screen content videos (SCVs) often exhibit homogeneous regions, abrupt content changes, and high prevalence of repetitive patterns. Existing deep learning (DL)-based video compression methods inadequately address the unique characteristics of SCVs, resulting in suboptimal compression performance. Therefore, in this paper, a dedicated deep screen content video compression (DSCVC) framework is proposed based on the motion and content characteristics of SCVs, which includes superpixel-constrained a motion estimation (SCME) module and inter and intra context aggregation (I2CA) module. The SCME is designed to construct a superpixel-based representation of homogeneous regions, leveraging the global correlations among superpixels to effectively capture large-scale motions, which efficiently improves the compression performance. I2CA is developed to jointly utilize inter and intra contexts, which employs a gating mechanism for content-aware context fusion, dynamically aggregating more similar contexts within SCVs. This allows for flexible adaptation to both contiguous and abrupt content changes within SCVs. Furthermore, by leveraging both learnable window and pixel displacements, a displacement-guided window attention mechanism is implemented in I2CA for precise long range repetitive feature localization, thereby reducing redundancy caused by repetitive patterns. To the best of our knowledge, it is the first DL-based video compression framework specifically designed for SCVs. Extensive experimental results demonstrate that the proposed DSCVC significantly outperforms existing methods in terms of compression performance, achieving a bitrate saving of 26.82% compared to VVC and a bitrate saving of 12.30% compared to SOTA DL-based methods.
- Research Article
7
- 10.1145/3715144
- Mar 10, 2025
- ACM Transactions on Multimedia Computing, Communications, and Applications
Recently, many works have applied deep learning techniques to video compression tasks, achieving promising results and advancing the field of Deep Learning-Based Video Compression (DLVC). However, the architecture design of the existing DLVC is rigid and limited in terms of flexibility. Specifically, different networks must be designed for different scenarios, such as delay-constrained scenario or non-delay-constrained scenario. Frequent switching between networks would reduce the speed of modern deep learning platforms and increase the maintenance costs. To address this problem, we propose a Unified Video Compression (UVC) framework that can be freely switched to different application scenarios without changing the network architecture. Our proposed UVC framework is based on the explicit-compression and implicit-generation perspective, which contains two sub-networks—the Explicit Reference Frame Compression Network (ERFCN) and the Implicit Reference Frame Generation Network (IRFGN). The aim of ERFCN is to compress the current frame with the help of the reference frame. To improve the performance of ERFCN, we first introduce the Transformer in this network, which can fully remove the spatial redundancy of the input image and is beneficial for the following inter-prediction process. We also develop a novel long-range motion estimation module for inter-prediction to generate motion vectors based on global motion information between two frames, which can handle long-range complex motion relations. The aim of IRFGN is to capture the temporal relationship between forward and backward reconstructed frames and synthesize a high-quality implicit reference frame for the current frame. To achieve this, we design the split spatial-temporal attention and multi-scale prediction module. We conduct extensive experiments on three widely used video compression databases (HEVC, UVG, and MCL-JVC), and the results demonstrate the superiority of our approach over other related DLVC methods.