Diffusion-based Perceptual Neural Video Compression with Temporal Diffusion Information Reuse
Recently, foundational diffusion models have attracted considerable attention in image compression tasks, whereas their application to video compression remains largely unexplored. In this article, we introduce DiffVC, a diffusion-based perceptual neural video compression framework that effectively integrates foundational diffusion model with the video conditional coding paradigm. This framework uses temporal context from previously decoded frame and the reconstructed latent representation of the current frame to guide the diffusion model in generating high-quality results. To accelerate the iterative inference process of diffusion model, we propose the Temporal Diffusion Information Reuse (TDIR) strategy, which significantly enhances inference efficiency with minimal performance loss by reusing the diffusion information from previous frames. Additionally, to address the challenges posed by distortion differences across various bitrates, we propose the Quantization Parameter-based Prompting (QPP) mechanism, which utilizes quantization parameters as prompts fed into the foundational diffusion model to explicitly modulate intermediate features, thereby enabling a robust variable bitrate diffusion-based neural compression framework. Experimental results demonstrate that our proposed solution delivers excellent performance in both perception metrics and visual quality.
- Conference Article
1
- 10.1109/vcip53242.2021.9675405
- Dec 5, 2021
Neural compression has benefited from technological advances such as convolutional neural networks (CNNs) to achieve advanced bitrates, especially in image compression. In neural image compression, an encoder and a decoder can run in parallel on a GPU, so the speed is relatively fast. However, the conventional entropy coding for neural image compression requires serialized iterations in which the probability distribution is estimated by multi-layer CNNs and entropy coding is processed on a CPU. Therefore, the total compression and decompression speed is slow. We propose a fast, practical, GPU-intensive entropy coding framework that consistently executes entropy coding on a GPU through highly parallelized tensor operations, as well as an encoder, decoder, and entropy estimator with an improved network architecture. We experimentally evaluated the speed and rate-distortion performance of the proposed framework and found that we could significantly increase the speed while maintaining the bitrate advantage of neural image compression.
- Preprint Article
- 10.5194/egusphere-egu24-19460
- Mar 11, 2024
Earth observation (EO) repositories comprise Petabytes of data. Due to their widespread use, these repositories experience extremely large volumes of data transfers. For example, users of the Sentinel Data Access System downloaded 78.6 PiB of data in 2022 alone. The transfer of such data volumes between data producers and consumers causes substantial latency and requires significant amounts of energy and vast storage capacities. This work introduces Neural Embedding Compression (NEC), a method that transmits compressed embeddings to users instead of raw data, greatly reducing transfer and storage costs. The approach uses general purpose embeddings from Foundation Models (FM), which can serve multiple downstream tasks and neural compression, which balances between compression rate and the utility of the embeddings. We implemented the method by updating a minor portion of the FM’s parameters (approximately 10%) for a short training period of about 1% of the original pre-training iterations. NEC’s effectiveness is assessed through two EO tasks: scene classification and semantic segmentation. When compared to traditional compression methods applied to raw data, NEC maintains similar accuracy levels while reducing data by 75% to 90%. Notably, even with a compression rate of 99.7%, there’s only a 5% decrease in accuracy for scene classification. In summary, NEC offers a resource-efficient yet effective solution for multi-task EO modeling with minimal transfer of data volumes.
- Research Article
1
- 10.3390/sym17060913
- Jun 10, 2025
- Symmetry
Deep neural video compression codecs have shown great promise in recent years. However, there are still considerable challenges for ultra-low bitrate video coding. Inspired by recent diffusion models for image and video compression attempts, we attempt to leverage diffusion models for ultra-low bitrate portrait video compression. In this paper, we propose a predictive portrait video compression method that leverages the temporal prediction capabilities of diffusion models. Specifically, we develop a temporal diffusion predictor based on a conditional latent diffusion model, with the predicted results serving as decoded frames. We symmetrically integrate a temporal diffusion predictor at the encoding and decoding side, respectively. When the perceptual quality of the predicted results in encoding end falls below a predefined threshold, a new frame sequence is employed for prediction. While the predictor at the decoding side directly generates predicted frames as reconstruction based on the evaluation results. This symmetry ensures that the prediction frames generated at the decoding end are consistent with those at the encoding end. We also design an adaptive coding strategy that incorporates frame quality assessment and adaptive keyframe control. To ensure consistent quality of subsequent predicted frames and achieve high perceptual reconstruction, this strategy dynamically evaluates the visual quality of the predicted results during encoding, retains the predicted frames that meet the quality threshold, and adaptively adjusts the length of the keyframe sequence based on motion complexity. The experimental results demonstrate that, compared with the traditional video codecs and other popular methods, the proposed scheme provides superior compression performance at ultra-low bitrates while maintaining competitiveness in visual effects, achieving more than 24% bitrate savings compared with VVC in terms of perceptual distortion.
- Conference Article
14
- 10.1109/dcc52660.2022.00082
- Mar 1, 2022
Recent advances in deep learning have led to superhuman performance across a variety of applications. Recently, these methods have been successfully employed to improve the rate-distortion performance in the task of image compression. However, current methods either use additional post-processing blocks on the decoder end to improve compression or propose an end-to-end compression scheme based on heuris-tics. For the majority of these, the trained deep neural networks (DNNs) are not compatible with standard encoders and would be difficult to deploy on personal com-puters and cellphones. In light of this, we propose a system that learns to improve the encoding performance by enhancing its internal neural representations on both the encoder and decoder ends, an approach we call Neural JPEG. We propose frequency domain pre-editing and post-editing methods to optimize the distribution of the DCT coefficients at both encoder and decoder ends in order to improve the stan-dard compression (JPEG) method. Moreover, we design and integrate a scheme for jointly learning quantization tables within this hybrid neural compression framework. In summary, our contributions are as follows:
- Research Article
- 10.1016/j.eswa.2024.125535
- Oct 11, 2024
- Expert Systems With Applications
Enhanced neural video compression for cloud gaming videos with aligned frame generation
- Book Chapter
- 10.1007/978-981-19-5096-4_3
- Jan 1, 2022
A novel trend in video compression is to use end-to-end optimized neural techniques. However, the rate-distortion (R-D) behavior of such scheme remains unexplored. In this paper, we for the first time study the essential characteristics of neural video compression (NVC) by comparatively modeling the R-D behavior of conventional codec and NVC. We give the observation that the proportion of required coding bits for motion field and residual are essentially different between the two kinds of codecs. We also show that improving the efficiency of inter prediction module would be the key factor to shorten the performance gap between NVC and conventional codec. Given such observation, we propose the rate-distortion modeling inspired neural video compression (RD-NVC) framework to increase prediction accuracy and reduce residual coding bits. For the former part, a novel prediction refinement network is proposed to improve predictive coding efficiency. Regarding the latter aspect, coarse-to-fine (C2F) residual modeling and in-loop restoration are proposed to save the residual coding bits. The proposed framework substantially promotes the R-D performance of NVC in a comprehensive manner. The experiment demonstrates that our method outperforms the state-of-the-art single reference frame NVC approaches. To the best of our knowledge, the proposed method is the first NVC that shows comparable R-D performance with H.266/VVC in terms of MS-SSIM under same prediction structure.KeywordsNeural networkVideo codingEnd-to-end optimizationRate-distortion analysis
- Book Chapter
34
- 10.1007/978-3-031-19809-0_32
- Jan 1, 2022
We present the first neural video compression method based on generative adversarial networks (GANs). Our approach significantly outperforms previous neural and non-neural video compression methods in a user study, setting a new state-of-the-art in visual quality for neural methods. We show that the GAN loss is crucial to obtain this high visual quality. Two components make the GAN loss effective: we i) synthesize detail by conditioning the generator on a latent extracted from the warped previous reconstruction to then ii) propagate this detail with high-quality flow. We find that user studies are required to compare methods, i.e., none of our quantitative metrics were able to predict all studies. We present the network design choices in detail, and ablate them with user studies. KeywordsNeural Video CompressionGANs
- Research Article
24
- 10.1109/tvcg.2024.3372096
- May 1, 2024
- IEEE Transactions on Visualization and Computer Graphics
Point cloud video (PCV) offers watching experiences in photorealistic 3D scenes with six-degree-of-freedom (6-DoF), enabling a variety of VR and AR applications. The user's Field of View (FoV) is more fickle with 6-DoF movement than 3-DoF movement in 360-degree video. PCV streaming is extremely bandwidth-intensive. However, current streaming systems require hundreds of Mbps bandwidth, exceeding the bandwidth capabilities of commodity devices. To save bandwidth, FoV-adaptive streaming predicts a user's FoV and only downloads point cloud data falling in the predicted FoV. But it is difficult to accurately predict the user's FoV even 2-3 seconds before playback due to 6-DoF. Misprediction of FoV or network bandwidth dips results in frequent stalls. To avoid rebuffering, existing systems would cause incomplete FoV and degraded experience, deteriorating the user's quality of experience (QoE). In this paper, we describe Fumos, a novel system that preserves interactive experience by avoiding playback stalls while maintaining high perceptual quality and high compression rate. We find a research gap in inter-frame redundant utilization and progressive mechaism. Fumos has three crucial designs, including (1) Neural compression framework with inter-frame coding, namely N-PCC, which achieves both bandwidth efficiency and high fidelity. (2) Progressive refinement streaming framework that enables continuous playback by incrementally upgrading a fetched portion to a higher quality (3) System-level adaptation that employs Lyapunov optimization to jointly optimize the long-term user QoE. Experimental results demonstrate that Fumos significantly outperforms Draco, achieving an average decoding rate acceleration of over 260×. Moreover, the proposed compression framework N-PCC attains remarkable BD-Rate gains, averaging 91.7% and 51.7% against the state-of-the-art point cloud compression methods G-PCC and V-PCC, respectively.
- Research Article
2
- 10.1016/j.neucom.2024.128525
- Sep 5, 2024
- Neurocomputing
Deep learning is being increasingly applied to image and video compression in a new paradigm known as neural video compression. While achieving impressive rate–distortion (RD) performance, neural video codecs (NVC) require heavy neural networks, which in turn have large memory and computational costs and often lack important functionalities such as variable rate. These are significant limitations to their practical application. Addressing these problems, recent slimmable image codecs can dynamically adjust their model capacity to elegantly reduce the memory and computation requirements, without harming RD performance. However, the extension to video is not straightforward due to the non-trivial interplay with complex motion estimation and compensation modules in most NVC architectures. In this paper we propose the slimmable video codec framework (SlimVC) that integrates an slimmable autoencoder and a motion-free conditional entropy model. We show that the slimming mechanism is also applicable to the more complex case of video architectures, providing SlimVC with simultaneous control of the computational cost, memory and rate, which are all important requirements in practice. We further provide detailed experimental analysis, and describe application scenarios that can benefit from slimmable video codecs.
- Conference Article
3
- 10.1109/csndsp.2008.4610736
- Jul 1, 2008
In video transmission over low-bandwidth channels, high-quality video and sufficient channel throughput should be guaranteed. The last two decades have witnessed an unprecedented growth of wireless communication technologies. It turned out that competition for bandwidth resources is fierce, and onboard power and weight constraints in autonomous vehicles limit the maximum data transmission rates. These factors highlight a critical need for very effective data compression schemes. Images tend to be the most bandwidth-intensive data; therefore, image and video compression methods are particularly valuable. In this paper, the authors present a novel technique for optimizing video compression while visual quality is not compromised. This opens up new scopes in the domain of visual quality assessment of images and videos returned from autonomous vehicles.
- Dissertation
- 10.5463/thesis.776
- Sep 11, 2024
Deep learning has become a major field with many applications: from face recognition to generating images to compressing data. As a result, deep learning is becoming more and more integrated into our daily lives. We demonstrate how deep learning can be deployed for several applications in three different domains namely, improving business processes for agriculture, high-dimensional density estimation with generative models, and neural compression of data. The first domain aims to optimize the business process of a seed breeding company operating in agriculture. Therefore, we examine a dataset of white cabbage seedling images. The aim is to predict the (un)successfulness of the seedlings based on only an image. Since accurate and early predictions can terminate the seedlings stay in a growth chamber, which provides more space for other seeds to grow. Further, automating the process aids professionals. We show how a particular convolutional neural network, AlexNet, outperforms the other machine learning methods and that the model can accurately determine if a seedling is going to grow (un)successfully. Moreover, we observe that training AlexNet on earlier days generalizes to predictions on later days. The second domain concerns the utilization of generative modeling for high dimensional density estimation since this is an open problem in deep learning. We aid to close the gap in estimating the true data distribution that is modeled with generative models. More concretely, we improve model performance of a generative model, known as the normalizing flow. Therefore, we construct new methods and propose an activation function, which we call Concatenated LipSwish. The new architecture is known as i-DenseNet and outperforms its predecessor Residual Flow and other comparable flow-based models on generative and hybrid modeling performance. Finally, the third domain covers the neural compression process for images and videos. With the growing amount of data worldwide, compression, in general, has become a fundamental part of data storage and transmission. We first examine a neural image compression model, known as the mean-scale hyperprior. Even though these models are effective in practice, they do have limited capacity when it comes to optimization and generalization. Therefore, we introduce three new refinement methods that aids the compression performance and results in improved compression results per image. Additionally, we aim to optimize the latents of an already pre-trained image compression model, by keeping the networks weights fixed, and only further optimizing its latents with the refinement procedures. We show how the method can be extended to three-class rounding, outperforms the baselines, can be used to move partly along the rate-distortion curve and how it is robust to hyperparameter changes. Finally, we introduce a neural video compression model, based on scale-space flow, that allocates more bits to pre-specified regions-of-interest. We introduce two versions that are able to achieve this, namely, an implicit and a latent scaling model. In general, both models out-perform all baselines in terms of the rate-distortion performance in regions of interest and can generalize to different datasets at inference time. The latent scaling model has the best performance and can explicitly control the quantization binwidth of latent variables by only using a single model during evaluation. Further, we find that the models show a negligible performance gap when trained with synthetic region-of-interest masks, which do not correlate with the content of the video, compared to training with pixel-wise annotated masks.
- Research Article
3
- 10.1109/mgrs.2025.3546527
- Sep 1, 2025
- IEEE Geoscience and Remote Sensing Magazine
Over the past decades, there has been an explosion in the amount of available Earth Observation (EO) data. The unprecedented coverage of the Earth’s surface and atmosphere by satellite imagery has resulted in large volumes of data that must be transmitted to ground stations, stored in data centers, and distributed to end users. Modern Earth System Models (ESMs) face similar challenges, operating at high spatial and temporal resolutions, producing petabytes of data per simulated day. Data compression has gained relevance over the past decade, with neural compression (NC) emerging from deep learning and information theory, making EO data and ESM outputs ideal candidates due to their abundance of unlabeled data. <p xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">In this review, we outline recent developments in NC applied to geospatial data. We introduce the fundamental concepts of NC including seminal works in its traditional applications to image and video compression domains with focus on lossy compression. We discuss the unique characteristics of EO and ESM data, contrasting them with “natural images”, and explain the additional challenges and opportunities they present. Additionally, we review current applications of NC across various EO modalities and explore the limited efforts in ESM compression to date. The advent of self-supervised learning (SSL) and foundation models (FM) has advanced methods to efficiently distill representations from vast unlabeled data. We connect these developments to NC for EO, highlighting the similarities between the two fields and elaborate on the potential of transferring compressed feature representations for machine–to–machine communication. Based on insights drawn from this review, we devise future directions relevant to applications in EO and ESM.
- Book Chapter
34
- 10.1007/978-3-030-80129-8_17
- Jan 1, 2021
We present a new PyTorch-based framework for neural network compression with fine-tuning named Neural Network Compression Framework (NNCF) (https://github.com/openvinotoolkit/nncf) . It leverages recent advances of various network compression methods and implements some of them, namely quantization, sparsity, filter pruning and binarization. These methods allow producing more hardware-friendly models that can be efficiently run on general-purpose hardware computation units (CPU, GPU) or specialized deep learning accelerators. We show that the implemented methods and their combinations can be successfully applied to a wide range of architectures and tasks to accelerate inference while preserving the original model’s accuracy. The framework can be used in conjunction with the supplied training samples or as a standalone package that can be seamlessly integrated into the existing training code with minimal adaptations.
- Research Article
- 10.54216/fpa.170219
- Jan 1, 2025
- Fusion: Practice and Applications
Solving the video compression problem requires a multi-faceted approach, balancing quality, efficiency, and computational demands. By leveraging advancements in technology and adapting to the evolving needs of video applications, it is possible to develop compression methods that meet the challenges of the present and future digital landscape. To address these objectives, machine learning and AI approaches can be utilized to predict and remove redundancies more effectively, optimizing compression algorithms dynamically based on content. Still, state-of-the art neural network-based video compression models need large and diverse datasets to generalize well across different types of video content. Wavelets can provide both time (spatial) and frequency localization, making them highly effective for video compression. This dual localization allows wavelet transforms to handle both rapid changes in video content and slow-moving scenes efficiently, leading to better compression ratios. Yet, some wavelet coefficients may be more critical for maintaining visual quality than others. Inaccurate quantization can lead to noticeable degradation. For the first time, the suggested model combine Quantum Wavelet Transform (QWT) and Neural Networks (NN) for video compression. This fusion model aims to achieve higher compression ratios, maintain video quality, and reduce computational complexity by utilizing QWT’s efficient data representation and NN’s powerful pattern recognition and predictive capabilities. Quantum bits (qubits) can encode large amounts of information in their quantum states, enabling more efficient data representation. This is especially useful for encoding large video files. Furthermore, quantum entanglement allows for correlated data representation across qubits, which can be exploited to capture intricate details and redundancies in video data more effectively than classical methods. The experimental results reveal that QWT achieves a compression ratio of almost twice that of traditional WT for the same video, maintaining superior visual quality due to more efficient redundancy elimination.
- Research Article
1
- 10.1016/j.patcog.2025.112780
- May 1, 2026
- Pattern Recognition
DiffProtect: Generative adversarial examples using diffusion models for facial privacy protection