Discovery Logo
Sign In
Search
Paper
Search Paper
R Discovery for Libraries Pricing Sign In
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
Discovery Logo menuClose menu
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
features
  • Audio Papers iconAudio Papers
  • Paper Translation iconPaper Translation
  • Chrome Extension iconChrome Extension
Content Type
  • Journal Articles iconJournal Articles
  • Conference Papers iconConference Papers
  • Preprints iconPreprints
  • Seminars by Cassyni iconSeminars by Cassyni
More
  • R Discovery for Libraries iconR Discovery for Libraries
  • Research Areas iconResearch Areas
  • Topics iconTopics
  • Resources iconResources

Articles published on Learning-Based Video

Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
129 Search results
Sort by
Recency
  • Research Article
  • 10.3791/69299
Behavioral Engagement Assessment in University Classrooms via Deep Learning-based Video Object Detection.
  • Mar 17, 2026
  • Journal of visualized experiments : JoVE
  • Miaomiao Feng + 1 more

This study aims to assess students' learning engagement in university classrooms using deep learning-based video object detection. To do so, via correlation analysis, this research first identified seven classroom behaviors presenting highly positive correlation with learning engagement as indicators to measure students' learning engagement; then it collected 30 synchronized videos of real classroom teaching from 6 classes from Shandong University of Science and Technology (SDUST) and divided them into a training set and a test set. After the seven behaviors were manually annotated in the training data, a machine learning algorithm was then trained in a supervised manner on this set. Once trained, the model generated initial annotations for the remaining unlabeled data. To achieve more accurate and efficient classroom behavior recognition, this study selected two representative algorithms, namely, Faster R-CNN and YOLOv5s, for behavior detection experiments. Based on a comparison of their detection performance in terms of accuracy and time cost, YOLOv5s was selected for classroom behavior detection in this study. Finally, this study used the focus group method to assign scores to each behavior and develop a three-level learning engagement scoring model. Based on automatically measured behavioral data, the model enables real-time, automatic assessment of learning engagement at both the individual and class levels.

  • Research Article
  • 10.1007/s00521-026-11949-9
Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition.
  • Mar 1, 2026
  • Neural computing & applications
  • Jian Sun + 1 more

Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.

  • Research Article
  • 10.3390/s26010321
Current-Aware Temporal Fusion with Input-Adaptive Heterogeneous Mixture-of-Experts for Video Deblurring
  • Jan 4, 2026
  • Sensors (Basel, Switzerland)
  • Yanwen Zhang + 2 more

In image sensing, measurements such as an object’s position or contour are typically obtained by analyzing digitized images. This method is widely used due to its simplicity. However, relative motion or inaccurate focus can cause motion and defocus blur, reducing measurement accuracy. Thus, video deblurring is essential. However, existing deep learning-based video deblurring methods struggle to balance high-quality deblurring, fast inference, and wide applicability. First, we propose a Current-Aware Temporal Fusion (CATF) framework, which focuses on the current frame in terms of both network architecture and modules. This reduces interference from unrelated features of neighboring frames and fully exploits current frame information, improving deblurring quality. Second, we introduce a Mixture-of-Experts module based on NAFBlocks (MoNAF), which adaptively selects expert structures according to the input features, reducing inference time. Third, we design a training strategy to support both sequential and temporally parallel inference. In sequential deblurring, we conduct experiments on the DVD, GoPro, and BSD datasets. Qualitative results show that our method effectively preserves image structures and fine details. Quantitative results further demonstrate that our method achieves clear advantages in terms of PSNR and SSIM. In particular, under the exposure setting of 3 ms–24 ms on the BSD dataset, our method achieves 33.09 dB PSNR and 0.9453 SSIM, indicating its effectiveness even in severely blurred scenarios. Meanwhile, our method achieves a good balance between deblurring quality and runtime efficiency. Moreover, the framework exhibits minimal error accumulation and performs effectively in temporal parallel computation. These results demonstrate that effective video deblurring serves as an important supporting technology for accurate image sensing.

  • Research Article
  • 10.1016/j.aei.2025.103903
Deep internal learning-based video compressive sensing for the identification of high-frequency structural dynamic characteristics using full-field vision methods
  • Jan 1, 2026
  • Advanced Engineering Informatics
  • Junying Wang + 4 more

Deep internal learning-based video compressive sensing for the identification of high-frequency structural dynamic characteristics using full-field vision methods

  • Research Article
  • 10.1109/access.2026.3670354
Simulated Artifacts and Data Augmentation for Real-World Video Motion Deblurring
  • Jan 1, 2026
  • IEEE Access
  • Sota Moriyama + 2 more

This paper proposes a data augmentation method that simulates artifacts specific to real-world videos as a preprocessing step for applying a deep learning-based video deblurring method to real-world videos. Conventional methods in video deblurring using deep learning have suffered from poor generalization performance. Even if a video deblurring method shows high accuracy on a test dataset in the same domain as the training dataset, it will show less accuracy when inferring real-world test videos. Therefore, we assume that real-world videos contain compression noise and image processing artifacts not included in training deblurring datasets. We introduce a data augmentation method that applies data transformations simulating these real-world video-specific degradations during training. In this study, we prepare a real-world test dataset with no ground truth videos using video captured by a commercially available smartphone. Then, we aim to improve the estimation accuracy of deblurring in real-world videos by performing inference using our data augmentation method.

  • Research Article
  • 10.34139/jscs.2025.15.4.67
뇌의 ‘선택과 집중’ 원리를 모방한 효율적 동영상 인식 AI 연구
  • Dec 31, 2025
  • Society for Standards Certification and Safety
  • Jungyung Kim + 1 more

Recent deep learning-based video recognition technologies, driven by advancements in deep neural networks such as 3D CNNs, have achieved superhuman accuracy. However, the increasing scale of these models has led to massive computational costs and power consumption. Furthermore, the "black-box" nature of their complex inference processes limits their application in high-reliability fields like autonomous driving and healthcare. To overcome these limitations, this study proposes a novel video recognition model that secures both intelligent efficiency and explainability by applying the biological brain's principles of 'Selective Attention' and 'Functional Specialization' to deep neural network design.We first replicated the study by Hiramoto & Cline (2024) and conducted an in-depth analysis of neural data from the optic tectum using unsupervised learning techniques. This statistically verified that neurons differentiate into 'expert groups' that respond only to specific spatiotemporal patterns, such as static backgrounds, horizontal movements, or complex rotations. To implement these biomimetic principles, we designed the Spatially Adaptive MovieNet, which actively selects the optimal computational path by analyzing the dynamic complexity of input videos in real-time. The core Intelligent Gating Module detects high-information regions within the video and employs a Winner-Takes-All mechanism to physically execute only one computational path—either 2D (static) or 3D (dynamic)—thereby realizing substantial acceleration instead of using a probability-based weighted sum. Furthermore, a multi-objective learning strategy including Sparsity Loss was established to induce the model to focus on motion, the key feature of the data, by enforcing sparsity in attention maps.Comparative experiments with a Standard 3D CNN demonstrated that the proposed model achieved the same top classification accuracy of 98.84% while reducing the number of parameters by approximately 98% (from 7.35M to 0.14M) and computational cost (FLOPs) from 0.27G to 0.22G. Notably, the proposed model demonstrated adaptive capability by self-selecting lighter computational modes for simple data and clearly presented the rationale for its decisions by accurately visualizing the motion trajectories of the subject through sparsity loss. This study presents a significant direction for future lightweight AI research for low-power edge devices by integrating neuroscientific insights into deep learning architectures to prove Intelligent Efficiency.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s10791-025-09834-5
Application of deep learning-based video AI analysis in coal mine safety monitoring
  • Dec 8, 2025
  • Discover Computing
  • Gang Guo + 7 more

Conventional safety monitoring methods are increasingly inadequate for the complex conditions of modern coal mines. This study introduces a safety monitoring framework based on an enhanced YOLO model, specifically adapted for underground environments. Improvements include optimized anchor box design with K-means clustering, which reduces detection errors and improves localization accuracy. Evaluations on benchmark datasets demonstrate superior results, with mAP scores of 0.82 on UCF101, 0.85 on MS COCO, and 0.80 on a coal mine video dataset. When integrated with ConvLSTM, the system achieves higher accuracy in miner behavior recognition, while the incorporation of sensor data enables precise prediction of gas concentration, temperature, and humidity. Additionally, the decision-making module provides reliable early warnings of hazards such as gas leaks, fire, and unsafe behaviors, achieving the highest detection accuracy and an average response time of only 3 s. The proposed system enhances detection performance, robustness, and real-time responsiveness, offering strong support for coal mine safety management.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1007/s10845-025-02720-3
Automatic melt pool tracking and segmentation in laser powder bed fusion using X-ray image sequence
  • Nov 11, 2025
  • Journal of Intelligent Manufacturing
  • Ruiyuan Zhang + 7 more

Abstract Laser Powder Bed Fusion is among the most widely used techniques for metal additive manufacturing. In this process, a laser melts metal powder onto a substrate, forming a melt pool. The solid-liquid interface of the melt pool plays a critical role in the cooling behavior, which in turn affects the microstructure and mechanical properties of the printed part. High-speed X-ray imaging enables real-time observation of subsurface melt pool dynamics. However, accurately segmenting the melt pool from X-ray images remains challenging due to high noise levels and low contrast. Efficient data processing methods for this task are still underdeveloped. Researchers often rely on manual image masking or basic image processing techniques for object segmentation, which are either labor-intensive or lack sufficient accuracy and robustness. This study introduces a deep learning-based video object segmentation model that automatically tracks and segments the melt pool, thereby determining the solid-liquid interface in X-ray image sequences. The model is semi-supervised and highly efficient, requiring manual image masking only for the first frame to predict segmentations in subsequent frames. It incorporates spatiotemporal attention modules to capture correlations within the image sequence effectively. Specifically, a co-attention module extracts temporal features from the previous frame, while attention blocks highlight key regions in the current frame. Experimental results show that integrating attention mechanisms significantly improves segmentation accuracy compared to state-of-the-art methods.

  • Research Article
  • Cite Count Icon 1
  • 10.1111/hel.70078
Real-Time Prediction of Helicobacter pylori Infection Using a Deep Learning Model During Esophagogastroduodenoscopy: A Prospective Multicenter Study.
  • Sep 1, 2025
  • Helicobacter
  • Li Yan-Dong + 8 more

Real-time assessment of Helicobacter pylori infection during esophagogastroduodenoscopy (EGD) is clinically valuable but remains technically challenging. We developed a deep learning-based system to predict H. pylori infection directly from EGD videos. This prospective multicenter diagnostic study enrolled patients undergoing EGD at three hospitals between September and December 2024. All patients underwent the 14C-urea breath test as the reference standard. The model integrated deep learning-based video analysis to predict gastric regions with H. pylori infection in real time. The primary outcomes were diagnostic accuracy, sensitivity, and specificity. Secondary outcomes included the positive predictive value, negative predictive value, and area under the receiver operating characteristic curve (AUC). Logistic regression was used to explore factors associated with diagnostic performance. Among the cohort of 701 patients, 42.4% were positive for H. pylori infection. The model achieved an AUC of 0.918 (95% CI: 0.895-0.937), with an accuracy of 86.3% (95% CI: 83.5%-88.8%), sensitivity of 86.9% (95% CI: 82.5%-90.5%), and specificity of 85.9% (95% CI: 82.1%-89.1%). By multivariate analysis, mucosal atrophy was independently associated with an increased diagnostic error (OR = 1.788, p = 0.014), while a higher examination quality score was protective (OR = 0.600, p < 0.001). This deep learning model demonstrated high diagnostic performance for real-time H. pylori detection during EGD across multiple centers and should be considered to improve diagnostic efficiency and consistency of clinical endoscopy. Chinese Clinical Trial Registry registration number: ChiCTR2400088612.

  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41598-025-10397-0
Deep learning-based video analysis for automatically detecting penetration and aspiration in videofluoroscopic swallowing study
  • Jul 7, 2025
  • Scientific Reports
  • Soyoung Kwak + 5 more

The videofluoroscopic swallowing study (VFSS) is the gold standard for diagnosing dysphagia, but its interpretation is time-consuming and requires expertise. This study developed a deep learning model for automatically detecting penetration and aspiration in VFSS and assessed its diagnostic accuracy. Images corresponding to the highest and lowest positions of the hyoid bone —representing the moment of upper esophageal sphincter opening during swallow and the pre-swallow and post-swallow phases, respectively— were automatically extracted from VFSS videos, resulting in a total of 18,145 images from 1,467 patients. The model was trained with a convolutional neural network architecture, incorporating techniques to address class imbalance and optimize performance. The model achieved high diagnostic accuracy at the patient level, with the area under the receiver operating characteristic curve values of 0.935 (normal swallowing), 0.889 (penetration), and 0.845 (aspiration). However, despite strong performance in identifying normal swallowing, the model exhibited low sensitivity for detecting penetration and aspiration. The findings suggest that the proposed model may reduce interpretation time by minimizing the need for repeated video review to identify penetration or aspiration, enabling clinicians to focus on other clinically relevant VFSS findings. Future studies should address its limitations by analyzing full-frame VFSS data and incorporating multicenter datasets.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1007/s11548-025-03442-w
Dual-task meta-auxiliary learning in laparoscopic cholecystectomy.
  • Jun 26, 2025
  • International journal of computer assisted radiology and surgery
  • Rui Guo + 4 more

Artificial intelligence is transforming surgical practices by improving procedural quality and decision-making. Machine learning-based video analysis can reliably identify surgical milestones, enhancing contextual understanding for surgeons. This study proposes a novel framework for detecting critical view of safety (CVS) in robot-assisted laparoscopic cholecystectomy (RLC) to improve procedural safety. We present a meta-auxiliary learning framework that delicately combines milestone recognition and anatomical segmentation to enhance contextual awareness. The framework addresses label imbalance by facilitating knowledge sharing across tasks, ensuring balanced optimization. A curated RLC dataset was utilized to evaluate CVS identification and multi-instance segmentation performance. The proposed method achieved an F1 score of 78% for CVS detection and a mean IOU of 83.9% for anatomical segmentation, demonstrating its efficacy in complex surgical environments. This framework establishes a new paradigm for surgical video analysis by integrating milestone detection and segmentation. Its ability to enhance decision support and procedural review in RLC highlights its potential for broader adoption in clinical practice.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/biology14070771
Integrating Deep Learning and Transcriptomics to Assess Livestock Aggression: A Scoping Review.
  • Jun 26, 2025
  • Biology
  • Roland Juhos + 3 more

The presence of aggressive behavior in livestock creates major difficulties for animal welfare, farm safety, economic performance and selective breeding. The two innovative tools of deep learning-based video analysis and transcriptomic profiling have recently appeared to aid the understanding and monitoring of such behaviors. This scoping review assesses the current use of these two methods for aggression research across livestock species and identifies trends while revealing unaddressed gaps in existing literature. A scoping literature search was performed through the PubMed, Scopus and Web of Science databases to identify articles from 2014 to April 2025. The research included 268 original studies which were divided into 250 AI-driven behavioral phenotyping papers and 18 transcriptomic investigations without any studies combining both approaches. Most research focused on economically significant species, including pigs and cattle, yet poultry and small ruminants, along with camels and fish and other species, received limited attention. The main developments include convolutional neural network (CNN)-based object detection and pose estimation systems, together with the transcriptomic identification of molecular pathways that link to aggression and stress. The main barriers to progress in the field include inconsistent behavioral annotation and insufficient real-farm validation together with limited cross-modal integration. Standardized behavior definitions, together with multimodal datasets and integrated pipelines that link phenotypic and molecular data, should be developed according to our proposal. These innovations will speed up the advancement of livestock welfare alongside precision breeding and sustainable animal production.

  • Research Article
  • 10.55041/isjem04127
Deep Learning for Video Summarization
  • Jun 7, 2025
  • International Scientific Journal of Engineering and Management
  • Sakshi Mohite

Abstract This project aims to develop a deep learning-based video summarization system that utilizes Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to analyze video content and generate concise summaries. The system will automatically identify key objects, events, and scenes in videos, and create summaries that capture the essential information. The project will explore various deep learning architectures and techniques to improve the quality and efficiency of video summarization.

  • Research Article
  • 10.1016/j.jort.2025.100890
Deep learning-based video analysis for visitor detection and tracking in protected areas
  • Jun 1, 2025
  • Journal of Outdoor Recreation and Tourism
  • Hugo Moreno + 2 more

Deep learning-based video analysis for visitor detection and tracking in protected areas

  • Research Article
  • 10.47191/ijmra/v8-i05-67
The Effect of Using Problem-Based Learning-Based Video Media on Learning Outcomes of Elementary School Students
  • May 31, 2025
  • INTERNATIONAL JOURNAL OF MULTIDISCIPLINARY RESEARCH AND ANALYSIS
  • Adia Adia + 4 more

In the current learning process, the use of innovative learning media is needed to improve students' understanding and learning outcomes. Civics subjects are often considered boring because they contain a lot of theoretical and abstract material. The purpose of the research is whether the use of problem-based learning-based video media affects student learning outcomes. The method of this research used observation, documentation, and tests. The research findings show that the low learning outcomes of students in Civics subjects are thought to be caused by the lack of interesting learning methods and media used. Using video media based on Problem Based Learning (PBL) in Civics learning to present real problems, stimulate students' critical thinking, improve student learning outcomes.

  • Research Article
  • 10.1093/bjs/znaf092.022
Automatic Identification of Teamwork Behaviors in the Operating Room
  • May 16, 2025
  • British Journal of Surgery
  • L Schewski + 5 more

Abstract Background Accurate identification of intraoperative behaviors is crucial for assessing surgical performance, improving patient outcomes, and supporting surgical training. Traditional methods for evaluating intraoperative behaviors rely on experts' on-site observations or assessments of video recordings. Although these methods have been shown to be reliable, they are time-consuming, prone to bias, and limited in scalability. Video recordings of the operating room (OR), combined with methodological advancements in computer vision and machine learning, offer promising opportunities for automated, objective, and scalable behavior analysis. Aims This study explores the feasibility of automated approaches for assessing teamwork-related intra-operative behaviors in the OR. In a stepwise approach, we aim to automatically: 1) detect the positions and poses of the OR team members, 2) analyze movements and distribution patterns of the OR team, 3) determine their roles and functions, and 4) recognize structured team communication (e.g. team timeouts, briefings). Methods A multi-view OR dataset with over 100 hours of video recordings was created at a Swiss university hospital, featuring annotations of team interactions during real surgical procedures. Using deep learning-based video techniques, a multidisciplinary team of work psychologists, computer scientists, and surgeons detects and analyzes key events of interest. Results A framework for automatic video analysis was developed and validated using the created dataset. The experimental results show that our framework provides a valuable and efficient alternative to existing state-of-the-art approaches for both surgical role classification and team communication detection tasks. Conclusion We present a novel pipeline that automatically classifies the roles of the OR team members and detects behavioral team interactions. This work highlights the potential of automated approaches to revolutionize surgical practice and education by providing scalable, objective insights into non-technical skills.

  • Research Article
  • 10.53759/7669/jmc202505076
A Machine Learning-Based Video Compression for Effective Video Encoding and Transmission
  • Apr 5, 2025
  • Journal of Machine and Computing
  • Bairavel S + 5 more

Deep Learning (DL) is revolutionizing video processing, as video is progressively key in daily life. Encoding and transmitting video effectively becomes challenging with fast content resolution and data volume. This research presents the most progressive method for Video Compressing (VC), using DL to enhance encoding and transmission efficiency, demonstrating the need for more cutting-edge methods in digital media. This work uses advanced Machine Learning (ML) to reduce video data size without compromising video quality, enhancing its suitability for high-definition streaming and videoconferencing. The algorithm uses Convolutional Neural Network (CNN)+Recurrent Neural Network (RNN) to improve video quality. CNN captures complex spatial details within each video frame, while LSTM relates across time. The proposed VC achieves high video quality rates compared to traditional methods like H.264 and H.265. It adapts in real-time and optimizes video bandwidth usage, making it useful for live streaming services and video conferencing. The VC has been tested extensively, demonstrating significant bit rate reduction while maintaining excellent video quality. It surpasses modern compression methods, making it a flexible solution to the increasing demands for the best video content. This invention in VC is expected to change digital media distribution for good.

  • Open Access Icon
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 38
  • 10.1109/tcsvt.2022.3229079
DeepStream: Video Streaming Enhancements Using Compressed Deep Neural Networks
  • Apr 1, 2025
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Hadi Amirpour + 2 more

InIn HTTP Adaptive Streaming (HAS), each video is divided into smaller segments, and each segment is encoded at multiple pre-defined bitrates to construct a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">bitrate ladder</i> . To optimize bitrate ladders, per-title encoding approaches encode each segment at various bitrates and resolutions to determine the convex hull. From the convex hull, an optimized bitrate ladder is constructed, resulting in an increased Quality of Experience (QoE) for end-users. With the ever-increasing efficiency of deep learning-based video enhancement approaches, they are more and more employed at the client-side to increase the QoE, specifically when GPU capabilities are available. Therefore, scalable approaches are needed to support end-user devices with both CPU and GPU capabilities (denoted as CPU-only and GPU-available end-users, respectively) as a new dimension of a bitrate ladder. To address this need, we propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeepStream</i> , a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">scalable content-aware</i> per-title encoding approach to support both CPU-only and GPU-available end-users. ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</i> ) To support <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">backward compatibility</i> , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeepStream</i> constructs a bitrate ladder based on any existing per-title encoding approach. Therefore, the video content will be provided for legacy end-user devices with CPU-only capabilities as a base layer (BL). ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ii</i> ) For high-end end-user devices with GPU capabilities, an enhancement layer (EL) is added on top of the base layer comprising lightweight video super-resolution deep neural networks (DNNs) for each bitrate-resolution pair of the bitrate ladder. A content-aware video super-resolution approach leads to higher video quality, however, at the cost of bitrate overhead. To reduce the bitrate overhead for streaming content-aware video super-resolution DNNs, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeepCABAC</i> , context-adaptive binary arithmetic coding for DNN compression, is used. Furthermore, the similarity among ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</i> ) segments within a scene and ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ii</i> ) frames within a segment are used to reduce the training costs of DNNs. Experimental results show bitrate savings of 34% and 36% to maintain the same PSNR and VMAF, respectively, for GPU-available end-users, while the CPU-only users get the desired video content as usual.

  • Research Article
  • Cite Count Icon 7
  • 10.1145/3715144
UVC: A Unified Deep Video Compression Framework
  • Mar 10, 2025
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Lv Tang + 2 more

Recently, many works have applied deep learning techniques to video compression tasks, achieving promising results and advancing the field of Deep Learning-Based Video Compression (DLVC). However, the architecture design of the existing DLVC is rigid and limited in terms of flexibility. Specifically, different networks must be designed for different scenarios, such as delay-constrained scenario or non-delay-constrained scenario. Frequent switching between networks would reduce the speed of modern deep learning platforms and increase the maintenance costs. To address this problem, we propose a Unified Video Compression (UVC) framework that can be freely switched to different application scenarios without changing the network architecture. Our proposed UVC framework is based on the explicit-compression and implicit-generation perspective, which contains two sub-networks—the Explicit Reference Frame Compression Network (ERFCN) and the Implicit Reference Frame Generation Network (IRFGN). The aim of ERFCN is to compress the current frame with the help of the reference frame. To improve the performance of ERFCN, we first introduce the Transformer in this network, which can fully remove the spatial redundancy of the input image and is beneficial for the following inter-prediction process. We also develop a novel long-range motion estimation module for inter-prediction to generate motion vectors based on global motion information between two frames, which can handle long-range complex motion relations. The aim of IRFGN is to capture the temporal relationship between forward and backward reconstructed frames and synthesize a high-quality implicit reference frame for the current frame. To achieve this, we design the split spatial-temporal attention and multi-scale prediction module. We conduct extensive experiments on three widely used video compression databases (HEVC, UVG, and MCL-JVC), and the results demonstrate the superiority of our approach over other related DLVC methods.

  • Research Article
  • 10.31449/inf.v49i10.7146
Temporal Transformer-Based Video Super-Resolution Reconstruction with Cross-Modal Attention
  • Jan 28, 2025
  • Informatica
  • Jingmin Gong + 1 more

With the increasing demand for high-definition video, video super-resolution technology has become a key means to improve video picture quality. Traditional video super-resolution methods are limited by computational resources and model complexity, which struggle to meet the demands of modern video processing. In recent years, the rise of deep learning technology has brought a revolutionary breakthrough for video super-resolution. In this paper, we propose a deep learning-based video superresolution reconstruction method that combines Transformer, cross-modal learning and fusion, and an attention mechanism. We design the Temporal Transformer-based Video Super-Resolution (TT-VSR) architecture, which significantly improves the accuracy and detail richness of video reconstruction by integrating the Transformer's self-attention mechanism with CNN's spatial feature extraction capabilities. The introduction of cross-modal learning and fusion, along with the cross-modal attention mechanism, further enhances the model's adaptability to complex scenes and detail recovery ability. Experimental results demonstrate that our model outperforms existing methods, achieving a PSNR of X dB and an SSIM of Y, indicating substantial improvements in image quality. These results validate the efficacy of our approach and open a new path for the development of video super-resolution technology.

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • .
  • 1
  • 2
  • 3
  • 4
  • 5

Popular topics

  • Latest Artificial Intelligence papers
  • Latest Nursing papers
  • Latest Psychology Research papers
  • Latest Sociology Research papers
  • Latest Business Research papers
  • Latest Marketing Research papers
  • Latest Social Research papers
  • Latest Education Research papers
  • Latest Accounting Research papers
  • Latest Mental Health papers
  • Latest Economics papers
  • Latest Education Research papers
  • Latest Climate Change Research papers
  • Latest Mathematics Research papers

Most cited papers

  • Most cited Artificial Intelligence papers
  • Most cited Nursing papers
  • Most cited Psychology Research papers
  • Most cited Sociology Research papers
  • Most cited Business Research papers
  • Most cited Marketing Research papers
  • Most cited Social Research papers
  • Most cited Education Research papers
  • Most cited Accounting Research papers
  • Most cited Mental Health papers
  • Most cited Economics papers
  • Most cited Education Research papers
  • Most cited Climate Change Research papers
  • Most cited Mathematics Research papers

Latest papers from journals

  • Scientific Reports latest papers
  • PLOS ONE latest papers
  • Journal of Clinical Oncology latest papers
  • Nature Communications latest papers
  • BMC Geriatrics latest papers
  • Science of The Total Environment latest papers
  • Medical Physics latest papers
  • Cureus latest papers
  • Cancer Research latest papers
  • Chemosphere latest papers
  • International Journal of Advanced Research in Science latest papers
  • Communication and Technology latest papers

Latest papers from institutions

  • Latest research from French National Centre for Scientific Research
  • Latest research from Chinese Academy of Sciences
  • Latest research from Harvard University
  • Latest research from University of Toronto
  • Latest research from University of Michigan
  • Latest research from University College London
  • Latest research from Stanford University
  • Latest research from The University of Tokyo
  • Latest research from Johns Hopkins University
  • Latest research from University of Washington
  • Latest research from University of Oxford
  • Latest research from University of Cambridge

Popular Collections

  • Research on Reduced Inequalities
  • Research on No Poverty
  • Research on Gender Equality
  • Research on Peace Justice & Strong Institutions
  • Research on Affordable & Clean Energy
  • Research on Quality Education
  • Research on Clean Water & Sanitation
  • Research on COVID-19
  • Research on Monkeypox
  • Research on Medical Specialties
  • Research on Climate Justice
Discovery logo
FacebookTwitterLinkedinInstagram

Download the FREE App

  • Play store Link
  • App store Link
  • Scan QR code to download FREE App

    Scan to download FREE App

  • Google PlayApp Store
FacebookTwitterTwitterInstagram
  • Universities & Institutions
  • Publishers
  • R Discovery PrimeNew
  • Ask R Discovery
  • Blog
  • Accessibility
  • Topics
  • Journals
  • Open Access Papers
  • Year-wise Publications
  • Recently published papers
  • Pre prints
  • Questions
  • FAQs
  • Contact us
Lead the way for us

Your insights are needed to transform us into a better research content provider for researchers.

Share your feedback here.

FacebookTwitterLinkedinInstagram
Cactus Communications logo

Copyright 2026 Cactus Communications. All rights reserved.

Privacy PolicyCookies PolicyTerms of UseCareers