Articles published on Video Compression
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
3511 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.eswa.2026.131360
- May 1, 2026
- Expert Systems with Applications
- Mohammad Ghasempour + 4 more
• Content-adaptive video encryption in the compressed domain • Dynamically selects syntax elements based on video content complexity • Maintains full format compliance using Adaptive Syntax Integrity (ASI) • Tunable parameters balance encryption strength and bitrate increase With the ever-increasing amount of digital video content, efficient encryption is crucial to protect visual content across diverse platforms. Existing methods often incur excessive bitrate overhead due to content variability. Furthermore, since most videos are already compressed, encryption in the compressed domain is essential to avoid processing overhead and re-compression quality loss. However, achieving both format compliance and compression efficiency while ensuring that the decoded content remains unrecognizable is challenging in the compressed domain, since only limited information is available without full decoding. This paper proposes an adaptive compressed domain video encryption (ACDC) method that dynamically adjusts the encryption strategy according to content characteristics. Two tunable parameters derived from the bitstream information enable adaptation to various application requirements. An adaptive syntax integrity method is employed to produce format-compliant bitstreams without full decoding. Experimental results show that ACDC reduces bitrate overhead by 48.2% and achieves a 31-fold speedup in encryption time compared to the latest state of the art, while producing visually unrecognizable outputs.
- Research Article
- 10.54254/2977-3903/2026.32901
- Apr 16, 2026
- Advances in Engineering Innovation
- Hai Li
With the explosive growth of multimedia data, traditional block-based hybrid coding frameworks (such as HEVC and VVC) face severe information loss during transformation, quantization, and entropy coding, approaching theoretical compression bottlenecks. Recently, the integration of deep learning has triggered a paradigm shift in image and video compression, particularly through new architectures based on frequency-domain processing and end-to-end optimization. This paper reviews recent advances in deep learning-assisted frequency-domain sampling reconstruction and end-to-end quantized coding. First, we trace the evolution from the traditional Discrete Cosine Transform (DCT) to content-adaptive intelligent frequency-domain sampling, and analyze strategies for generating sparse sampling patterns based on semantic importance. Second, we examine reconstruction networks using hybrid Transformer-CNN architectures, discussing their advantages for high-fidelity recovery of full-frequency coefficients and the trade-offs across various model designs. Furthermore, we analyze the synergistic mechanisms of differentiable quantizers and autoregressive context-based entropy coding within end-to-end rate-distortion optimization. Comprehensive evaluations on standard datasets such as Kodak and CLIC indicate that end-to-end frameworks integrating intelligent frequency sampling and hybrid reconstruction represent the most efficient current technical approach. Compared to traditional VVC encoders and early deep learning schemes, these methods achieve significant BD-rate gains (approximately 5%–12%) at identical bitrates, effectively preserving texture details and edge structures, especially under high compression ratios where traditional methods suffer from artifacts. Finally, we outline future directions, including computational complexity optimization, extension to general video coding, and hardware-friendly deployment.
- Research Article
- 10.1109/tip.2026.3682128
- Apr 14, 2026
- IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
- Shuyun Wang + 6 more
Current compressed video super-resolution methods have achieved promising performance, but they often assume that an input video is compressed under low-delay configurations. However, under random access configurations, those methods might struggle to leverage the metadata effectively due to the large variations of metadata in different compression configurations. In this work, we propose a Compression-Oriented Video Super-Resolution (COVSR) method that can address video superresolution for both low-delay and random-access configurations. Specifically, we first introduce an efficient compression-aware propagation (ECAP) module that dynamically adjusts propagation routes in accordance with the compression configurations. Since existing methods require reconstructing frames in a frameby- frame manner, it is difficult to achieve efficient parallelization. However, we find that by slightly relaxing sequential dependencies, our ECAP can significantly improve inference speed. Furthermore, existing methods typically perform alignment between adjacent frames or adjacent features. However, since ECAP may propagate features along non-adjacent reference routes, it introduces new challenges for accurate cross-frame feature alignment. In response, we propose a metadata-driven alignment (MDA) module that refines cross-frame motion vectors into dense, feature-level flow offsets, enabling precise alignment across temporally distant features. Extensive experimental results demonstrate that our COVSR not only achieves efficient and superior super-resolution performance but also is generalizable to various compression configurations. Our code will be available and the project page is at https://covsr.github.io.
- Research Article
- 10.3390/info17040366
- Apr 13, 2026
- Information
- Marek Domański + 2 more
Modern video compression is implemented in complex software systems that reuse software modules from various sources. This is particularly evident in experimental software systems designed for researching and standardizing new compression technologies. These systems often incorporate software modules operating in different color spaces. For example, AI-based techniques are often used in video coding experiments. The corresponding software modules often operate on RGB representations, while other modules operate on YCBCR components. In this study, we demonstrate that the quality loss resulting from color transformations is comparable to the respective quantization noise. Consecutive cycles of color transformations do not result in significant additional degradation. However, for image compression, very different results are obtained in different color representations. This aspect must be carefully considered in compression research. This paper supports these considerations with extensive experimental results in the context of ITU Recommendations BT.709 and BT.2020, as well as AVC and HEVC compression.
- Research Article
- 10.25258/ijddt.16.7s.72
- Apr 11, 2026
- International Journal of Drug Delivery Technology
- Dr Vipparthy Bhagya Raju + 2 more
Human Action Recognition (HAR) from video streams has many possible uses in areas like healthcare, surveillance, and human-computer interaction. The original purpose of video compression methods like SPIHT and others was to work with pixel-level quality measurements like PSNR and SSIM. These indicators have nothing to do with how well recognition works. In this paper, we present a Task-Aware Progressive SPIHT Framework that prioritises spatio-temporal data critical to actions during compression. By combining efficient pose estimation algorithms with lightweight motion and posture cues from opticalflow magnitude maps, you can make a significance mask that shows the areas that are most important for understanding action. We present a 3D Temporal-Priority SPIHT method that utilises motion-based dependencies among video frames, alongside spatial and temporal dependencies. Additional-ly, a Policy-Gradient-based Bit-Dropping method and Weighted Significance Testing are used to dynamically give bits to coefficients that are more important for the skeleton and motion while hiding background information that isn't important. Experimental tests show that the proposed framework works well for video analytics applications that need to work in real time and have limited resources. It greatly improves action detection accuracy at low bitrates while keeping compression efficiency competitive.
- Research Article
- 10.1016/j.displa.2025.103333
- Apr 1, 2026
- Displays
- Jiajia Wang + 4 more
MSTF-Net: A Multi-Scale Transformer and Frequency-Spatial Fusion Network for compressed video frame quality enhancement (ChinaMM)
- Research Article
- 10.1016/j.patcog.2025.112696
- Apr 1, 2026
- Pattern Recognition
- Jiangwan Zhou + 4 more
Efficient motion-centric CLIP for compressed video action recognition
- Research Article
- 10.1007/s44336-026-00035-2
- Mar 23, 2026
- Vicinagearth
- Xiangyu Chen + 5 more
Abstract Whether a video can be compressed at an extreme compression rate as low as 0.01%? To this end, we achieve the compression rate as 0.02% at some cases by introducing Generative Video Compression (GVC), a new framework that redefines the limits of video compression by leveraging modern generative video models to achieve extreme compression rates while preserving a perception-centric, task-oriented communication paradigm, corresponding to Level C of the Shannon–Weaver model. Besides, How we trade computation for compression rate or bandwidth? GVC answers this question by shifting the burden from transmission to inference: it encodes video into extremely compact representations and delegates content reconstruction to the receiver, where powerful generative priors synthesize high-quality video from minimal transmitted information. Is GVC practical and deployable? To ensure practical deployment, we propose a compression–computation trade-off strategy, enabling fast inference on consumer-grade GPUs. Within the AI Flow framework, GVC opens new possibility for video communication in bandwidth- and resource-constrained environments such as emergency rescue, remote surveillance, and mobile edge computing. Through empirical validation, we demonstrate that GVC offers a viable path toward a new effective, efficient, scalable, and practical video communication paradigm.
- Research Article
- 10.55041/isjem05816
- Mar 23, 2026
- International Scientific Journal of Engineering & Management
- Vidya Sampat Gadhave + 1 more
The rapid proliferation of hyper-realistic, AI-generated "deepfake" videos has created significant societal risks, from political disinformation to identity fraud. Current detection methodologies, primarily based on Convolutional Neural Networks (CNNs), struggle to generalize across different forgery methods and are vulnerable to post-processing compression. This paper proposes a novel framework, Hybrid Spatial-Temporal Transformer (HST-Trans), designed to overcome these limitations. The HST-Trans architecture integrates an EfficientNetV2 backbone for capturing micro-level spatial anomalies (like skin texture inconsistencies) with a Vision Transformer (ViT) to model macro-level global dependencies and temporal flickering. Our evaluation on the FaceForensics++ and Celeb-DF v2 datasets demonstrates that this hybrid approach achieves a state-of-the-art accuracy of 98.4% and shows significantly improved robustness against video compression compared to pure CNN models. This research provides a critical step toward reliable, "in-the-wild" deepfake detection. Keywords:Deepfake Detection, Facial Manipulation, Hybrid Deep Learning, Vision Transformers, Generalization Gap, Digital Forensics.
- Research Article
- 10.3390/electronics15061323
- Mar 22, 2026
- Electronics
- Udara Jayasinghe + 1 more
Reliable video transmission over error-prone channels remains a significant challenge due to the inherent trade-off between compression efficiency and noise resilience in conventional systems. To address these issues, this paper introduces a novel quantum Fourier transform (QFT)-based framework that integrates video compression and transmission within a unified quantum frequency-domain representation. The framework converts video data into a classical bitstream and maps it onto multi-qubit quantum states with variable encoding sizes (n), enabling flexible control over compression levels. Through the application of the QFT, these states are transformed into the frequency domain, where only selected coefficients are transmitted to reduce bandwidth requirements. At the receiver, the transmitted components are used to reconstruct the full representation, followed by inverse transformation and decoding to recover the video sequence. The performance of the proposed framework is evaluated using peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and video multi-method assessment fusion (VMAF). The results demonstrate that increasing the number of qubits enables exponential compression, achieving ratios up to 2n:1, while maintaining high reconstruction quality under ideal transmission conditions. However, higher-qubit configurations exhibit increased sensitivity to channel noise, leading to a more rapid degradation as the signal-to-noise ratio decreases. In contrast, lower-qubit configurations provide improved robustness, maintaining more stable reconstruction quality under noisy conditions, albeit with reduced compression efficiency. Among the evaluated configurations, the two-qubit system achieves an effective trade-off, providing a compression ratio of 4:1 while maintaining strong visual and structural fidelity along with enhanced resilience to channel impairments.
- Research Article
- 10.1007/s11263-026-02787-2
- Mar 10, 2026
- International Journal of Computer Vision
- Tao Wu + 5 more
CompViT: Real-Time Compressed Video Action Recognition with Asymmetric Transformer Networks
- Research Article
- 10.1016/j.patcog.2026.113455
- Mar 1, 2026
- Pattern Recognition
- Dengpan Ye + 7 more
Multi-View Facial Forgery Detection for Compressed Videos Using Metric Learning and Attention Transfer
- Research Article
- 10.3390/s26051522
- Feb 28, 2026
- Sensors (Basel, Switzerland)
- Keren He + 4 more
Down-sampling-based video compression frameworks have shown great potential in improving compression efficiency in modern sensing and imaging systems. However, existing methods ignore critical spatial and temporal redundancy, and treat all frames uniformly during down-sampling. This leads to the loss of important information and impacts compression efficiency. To address these limitations, this paper proposes a temporal down-sampling system, in which only intermediate frames are down-sampled while preserving key frames with high quality for reference. On the decoding side, we employ a frame-recurrent enhancement mechanism to maximize the use of temporal redundancy information. In the fusion of enhancement stage, we design a Multi-scale Temporal-Spatial Attention (MTSA) module. MTSA consists of two components: Multi-Temporal Attention (MTA) and Pyramid Spatial Attention (PSA). MTA performs multi-scale temporal correlation modeling, expanding the receptive field and providing stable cues in compressed regions. PSA integrates local spatial saliency and contextual structure in a progressive and multi-stage manner. Extensive experiments show that our approach achieves consistent BD-rate reductions. Under All-Intra, Low-Delay-P, and Random Access configurations, we observe BD-rate reductions of I, P, and B frames ranging from 14% to 39% compared to VVC, and outperform prior approaches anchored by the standard HEVC.
- Research Article
- 10.1364/oe.581953
- Feb 24, 2026
- Optics express
- Ningchi Li + 9 more
This paper proposes an end-to-end, multi-domain joint compression method for 3D light field video based on a viewpoint-disparity representation. By compressing dense viewpoints into sparse viewpoints with associated disparity and establishing a closed-loop "motion vector → disparity → view synthesis" pathway, our method achieves an 81% BD-rate reduction and a 1.998 dB BD-PSNR improvement compared to the MV-HEVC standard. Furthermore, the approach successfully decouples decoding time from the number of viewpoints, maintaining a stable latency of 28 ms during 96-viewpoint rendering. This work provides an effective solution for efficient compression of dense 3D light field video while establishing a theoretical foundation for its real-time transmission.
- Research Article
- 10.34190/iccws.21.1.4545
- Feb 19, 2026
- International Conference on Cyber Warfare and Security
- Arif Ullah + 4 more
This study investigates the detection of deepfake images and videos on social media platforms such as Instagramfor forensic analysis using hybrid-learning approaches. It highlights the critical importance of safeguarding privacy andauthenticity in digital media. The background draws attention to the growing threat posed by deepfakes, which posesignificant challenges across multiple domains, such as politics and entertainment. Existing methods often depend on visualfeatures specific to a dataset and struggle to generalize across different manipulation techniques. Moreover, mostapproaches focus exclusively on either temporal or spatial features, which limits their capacity to identify complex anomaliesinvolving fused facial features like the mouth, nose and eyes. Important solutions to these challenges include ConvolutionalNeural Network (CNN), Recurrent Neural Networks (RNN) and hybrid architectures that simultaneously capture spatial andtemporal information in deepfake content, such as Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM),Gated Recurrent Unit (GRU) and Vision Transformers (ViT). Additionally, this paper introduces a novel combination of artifactinspection and facial landmark recognition to enhance detection accuracy and employs Gated Recurrent Units (GRUs) andVision Transformers (ViT) for data augmentation thereby improving model robustness. The effectiveness of the proposedapproach is validated through experiments demonstrating substantially improved deduction accuracy, with improvementexceeding 1.5% across multiple datasets. However, several challenges remain, including limited robustness to noise, difficultyin detecting deepfakes in compressed video formats, and dataset imbalances issues. The proposed enhanced hybrid modelexhibits superior detection performance while maintaining adaptability across multiple datasets. Future research will focusstrengthening model generalization to effectively counter emerging deepfake generation techniques.
- Research Article
- 10.1007/s11042-026-21231-8
- Feb 13, 2026
- Multimedia Tools and Applications
- Rosemarie Anton + 2 more
Joint video compression and encryption algorithm for real-time secure transmission using quadratic chaotic map
- Research Article
- 10.15622/ia.25.1.4
- Feb 4, 2026
- Информатика и автоматизация
- Aishwarya Rajeev + 1 more
Deepfake detection continues to pose significant challenges, primarily because existing methods often suffer from key limitations, including reliance on individual frame analysis, vulnerability to low-resolution or compressed videos, and inability to capture temporal inconsistencies. Furthermore, traditional face detection techniques frequently fail under challenging conditions such as poor lighting or occlusion, while many models struggle with subtle manipulations due to inadequate feature extraction and overfitting on limited datasets. To address the drawbacks of existing deepfake detection approaches, this research proposes a Face and Motion-Aware Detection Framework that integrates both spatial and temporal information. The framework begins with a preprocessing stage that extracts video frames at a fixed rate to ensure temporal consistency. Facial regions and detailed landmarks are accurately detected using BlazeFace and MediaPipe Face Mesh. These features are then processed by the proposed XceptionCapsule Net, which combines the spatial feature extraction capabilities of the Xception model with the hierarchical and viewpoint-aware representation of Capsule Networks (CapsNet), and the temporal dependency modeling power of a Bidirectional Long Short-Term Memory (BiLSTM) layer. The architecture incorporates Global Average Pooling, Flatten, and fully connected layers, with Sigmoid activation for binary classification. Extensive evaluations on the FaceForensics++ (FF++) and Celeb-DF datasets demonstrate strong performance, achieving up to 99.31% accuracy and 99.99% Area Under the Curve (AUC). The results validate the framework’s effectiveness, precision, and generalization across various video qualities and manipulation scenarios.
- Research Article
- 10.1016/j.knosys.2025.115215
- Feb 1, 2026
- Knowledge-Based Systems
- Yilei Chen
ZRENet: A 3D-guided generative framework for zero-shot restoration of highly compressed talking face videos
- Research Article
- 10.1016/j.dsp.2025.105799
- Feb 1, 2026
- Digital Signal Processing
- Quanxu Zhao + 4 more
Self-attention based motion error correction for neural video compression
- Research Article
1
- 10.1109/tpami.2025.3625063
- Feb 1, 2026
- IEEE transactions on pattern analysis and machine intelligence
- Yuan Tian + 5 more
Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.