Articles published on Learning-Based Video
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
129 Search results
Sort by Recency
- Research Article
- 10.3791/69299
- Mar 17, 2026
- Journal of visualized experiments : JoVE
- Miaomiao Feng + 1 more
This study aims to assess students' learning engagement in university classrooms using deep learning-based video object detection. To do so, via correlation analysis, this research first identified seven classroom behaviors presenting highly positive correlation with learning engagement as indicators to measure students' learning engagement; then it collected 30 synchronized videos of real classroom teaching from 6 classes from Shandong University of Science and Technology (SDUST) and divided them into a training set and a test set. After the seven behaviors were manually annotated in the training data, a machine learning algorithm was then trained in a supervised manner on this set. Once trained, the model generated initial annotations for the remaining unlabeled data. To achieve more accurate and efficient classroom behavior recognition, this study selected two representative algorithms, namely, Faster R-CNN and YOLOv5s, for behavior detection experiments. Based on a comparison of their detection performance in terms of accuracy and time cost, YOLOv5s was selected for classroom behavior detection in this study. Finally, this study used the focus group method to assign scores to each behavior and develop a three-level learning engagement scoring model. Based on automatically measured behavioral data, the model enables real-time, automatic assessment of learning engagement at both the individual and class levels.
- Research Article
- 10.1007/s00521-026-11949-9
- Mar 1, 2026
- Neural computing & applications
- Jian Sun + 1 more
Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.
- Research Article
- 10.3390/s26010321
- Jan 4, 2026
- Sensors (Basel, Switzerland)
- Yanwen Zhang + 2 more
In image sensing, measurements such as an object’s position or contour are typically obtained by analyzing digitized images. This method is widely used due to its simplicity. However, relative motion or inaccurate focus can cause motion and defocus blur, reducing measurement accuracy. Thus, video deblurring is essential. However, existing deep learning-based video deblurring methods struggle to balance high-quality deblurring, fast inference, and wide applicability. First, we propose a Current-Aware Temporal Fusion (CATF) framework, which focuses on the current frame in terms of both network architecture and modules. This reduces interference from unrelated features of neighboring frames and fully exploits current frame information, improving deblurring quality. Second, we introduce a Mixture-of-Experts module based on NAFBlocks (MoNAF), which adaptively selects expert structures according to the input features, reducing inference time. Third, we design a training strategy to support both sequential and temporally parallel inference. In sequential deblurring, we conduct experiments on the DVD, GoPro, and BSD datasets. Qualitative results show that our method effectively preserves image structures and fine details. Quantitative results further demonstrate that our method achieves clear advantages in terms of PSNR and SSIM. In particular, under the exposure setting of 3 ms–24 ms on the BSD dataset, our method achieves 33.09 dB PSNR and 0.9453 SSIM, indicating its effectiveness even in severely blurred scenarios. Meanwhile, our method achieves a good balance between deblurring quality and runtime efficiency. Moreover, the framework exhibits minimal error accumulation and performs effectively in temporal parallel computation. These results demonstrate that effective video deblurring serves as an important supporting technology for accurate image sensing.
- Research Article
- 10.1016/j.aei.2025.103903
- Jan 1, 2026
- Advanced Engineering Informatics
- Junying Wang + 4 more
Deep internal learning-based video compressive sensing for the identification of high-frequency structural dynamic characteristics using full-field vision methods
- Research Article
- 10.1109/access.2026.3670354
- Jan 1, 2026
- IEEE Access
- Sota Moriyama + 2 more
This paper proposes a data augmentation method that simulates artifacts specific to real-world videos as a preprocessing step for applying a deep learning-based video deblurring method to real-world videos. Conventional methods in video deblurring using deep learning have suffered from poor generalization performance. Even if a video deblurring method shows high accuracy on a test dataset in the same domain as the training dataset, it will show less accuracy when inferring real-world test videos. Therefore, we assume that real-world videos contain compression noise and image processing artifacts not included in training deblurring datasets. We introduce a data augmentation method that applies data transformations simulating these real-world video-specific degradations during training. In this study, we prepare a real-world test dataset with no ground truth videos using video captured by a commercially available smartphone. Then, we aim to improve the estimation accuracy of deblurring in real-world videos by performing inference using our data augmentation method.
- Research Article
- 10.34139/jscs.2025.15.4.67
- Dec 31, 2025
- Society for Standards Certification and Safety
- Jungyung Kim + 1 more
Recent deep learning-based video recognition technologies, driven by advancements in deep neural networks such as 3D CNNs, have achieved superhuman accuracy. However, the increasing scale of these models has led to massive computational costs and power consumption. Furthermore, the "black-box" nature of their complex inference processes limits their application in high-reliability fields like autonomous driving and healthcare. To overcome these limitations, this study proposes a novel video recognition model that secures both intelligent efficiency and explainability by applying the biological brain's principles of 'Selective Attention' and 'Functional Specialization' to deep neural network design.We first replicated the study by Hiramoto & Cline (2024) and conducted an in-depth analysis of neural data from the optic tectum using unsupervised learning techniques. This statistically verified that neurons differentiate into 'expert groups' that respond only to specific spatiotemporal patterns, such as static backgrounds, horizontal movements, or complex rotations. To implement these biomimetic principles, we designed the Spatially Adaptive MovieNet, which actively selects the optimal computational path by analyzing the dynamic complexity of input videos in real-time. The core Intelligent Gating Module detects high-information regions within the video and employs a Winner-Takes-All mechanism to physically execute only one computational path—either 2D (static) or 3D (dynamic)—thereby realizing substantial acceleration instead of using a probability-based weighted sum. Furthermore, a multi-objective learning strategy including Sparsity Loss was established to induce the model to focus on motion, the key feature of the data, by enforcing sparsity in attention maps.Comparative experiments with a Standard 3D CNN demonstrated that the proposed model achieved the same top classification accuracy of 98.84% while reducing the number of parameters by approximately 98% (from 7.35M to 0.14M) and computational cost (FLOPs) from 0.27G to 0.22G. Notably, the proposed model demonstrated adaptive capability by self-selecting lighter computational modes for simple data and clearly presented the rationale for its decisions by accurately visualizing the motion trajectories of the subject through sparsity loss. This study presents a significant direction for future lightweight AI research for low-power edge devices by integrating neuroscientific insights into deep learning architectures to prove Intelligent Efficiency.
- Research Article
3
- 10.1007/s10791-025-09834-5
- Dec 8, 2025
- Discover Computing
- Gang Guo + 7 more
Conventional safety monitoring methods are increasingly inadequate for the complex conditions of modern coal mines. This study introduces a safety monitoring framework based on an enhanced YOLO model, specifically adapted for underground environments. Improvements include optimized anchor box design with K-means clustering, which reduces detection errors and improves localization accuracy. Evaluations on benchmark datasets demonstrate superior results, with mAP scores of 0.82 on UCF101, 0.85 on MS COCO, and 0.80 on a coal mine video dataset. When integrated with ConvLSTM, the system achieves higher accuracy in miner behavior recognition, while the incorporation of sensor data enables precise prediction of gas concentration, temperature, and humidity. Additionally, the decision-making module provides reliable early warnings of hazards such as gas leaks, fire, and unsafe behaviors, achieving the highest detection accuracy and an average response time of only 3 s. The proposed system enhances detection performance, robustness, and real-time responsiveness, offering strong support for coal mine safety management.
- Research Article
2
- 10.1007/s10845-025-02720-3
- Nov 11, 2025
- Journal of Intelligent Manufacturing
- Ruiyuan Zhang + 7 more
Abstract Laser Powder Bed Fusion is among the most widely used techniques for metal additive manufacturing. In this process, a laser melts metal powder onto a substrate, forming a melt pool. The solid-liquid interface of the melt pool plays a critical role in the cooling behavior, which in turn affects the microstructure and mechanical properties of the printed part. High-speed X-ray imaging enables real-time observation of subsurface melt pool dynamics. However, accurately segmenting the melt pool from X-ray images remains challenging due to high noise levels and low contrast. Efficient data processing methods for this task are still underdeveloped. Researchers often rely on manual image masking or basic image processing techniques for object segmentation, which are either labor-intensive or lack sufficient accuracy and robustness. This study introduces a deep learning-based video object segmentation model that automatically tracks and segments the melt pool, thereby determining the solid-liquid interface in X-ray image sequences. The model is semi-supervised and highly efficient, requiring manual image masking only for the first frame to predict segmentations in subsequent frames. It incorporates spatiotemporal attention modules to capture correlations within the image sequence effectively. Specifically, a co-attention module extracts temporal features from the previous frame, while attention blocks highlight key regions in the current frame. Experimental results show that integrating attention mechanisms significantly improves segmentation accuracy compared to state-of-the-art methods.
- Research Article
1
- 10.1111/hel.70078
- Sep 1, 2025
- Helicobacter
- Li Yan-Dong + 8 more
Real-time assessment of Helicobacter pylori infection during esophagogastroduodenoscopy (EGD) is clinically valuable but remains technically challenging. We developed a deep learning-based system to predict H. pylori infection directly from EGD videos. This prospective multicenter diagnostic study enrolled patients undergoing EGD at three hospitals between September and December 2024. All patients underwent the 14C-urea breath test as the reference standard. The model integrated deep learning-based video analysis to predict gastric regions with H. pylori infection in real time. The primary outcomes were diagnostic accuracy, sensitivity, and specificity. Secondary outcomes included the positive predictive value, negative predictive value, and area under the receiver operating characteristic curve (AUC). Logistic regression was used to explore factors associated with diagnostic performance. Among the cohort of 701 patients, 42.4% were positive for H. pylori infection. The model achieved an AUC of 0.918 (95% CI: 0.895-0.937), with an accuracy of 86.3% (95% CI: 83.5%-88.8%), sensitivity of 86.9% (95% CI: 82.5%-90.5%), and specificity of 85.9% (95% CI: 82.1%-89.1%). By multivariate analysis, mucosal atrophy was independently associated with an increased diagnostic error (OR = 1.788, p = 0.014), while a higher examination quality score was protective (OR = 0.600, p < 0.001). This deep learning model demonstrated high diagnostic performance for real-time H. pylori detection during EGD across multiple centers and should be considered to improve diagnostic efficiency and consistency of clinical endoscopy. Chinese Clinical Trial Registry registration number: ChiCTR2400088612.
- Research Article
1
- 10.1038/s41598-025-10397-0
- Jul 7, 2025
- Scientific Reports
- Soyoung Kwak + 5 more
The videofluoroscopic swallowing study (VFSS) is the gold standard for diagnosing dysphagia, but its interpretation is time-consuming and requires expertise. This study developed a deep learning model for automatically detecting penetration and aspiration in VFSS and assessed its diagnostic accuracy. Images corresponding to the highest and lowest positions of the hyoid bone —representing the moment of upper esophageal sphincter opening during swallow and the pre-swallow and post-swallow phases, respectively— were automatically extracted from VFSS videos, resulting in a total of 18,145 images from 1,467 patients. The model was trained with a convolutional neural network architecture, incorporating techniques to address class imbalance and optimize performance. The model achieved high diagnostic accuracy at the patient level, with the area under the receiver operating characteristic curve values of 0.935 (normal swallowing), 0.889 (penetration), and 0.845 (aspiration). However, despite strong performance in identifying normal swallowing, the model exhibited low sensitivity for detecting penetration and aspiration. The findings suggest that the proposed model may reduce interpretation time by minimizing the need for repeated video review to identify penetration or aspiration, enabling clinicians to focus on other clinically relevant VFSS findings. Future studies should address its limitations by analyzing full-frame VFSS data and incorporating multicenter datasets.
- Research Article
1
- 10.1007/s11548-025-03442-w
- Jun 26, 2025
- International journal of computer assisted radiology and surgery
- Rui Guo + 4 more
Artificial intelligence is transforming surgical practices by improving procedural quality and decision-making. Machine learning-based video analysis can reliably identify surgical milestones, enhancing contextual understanding for surgeons. This study proposes a novel framework for detecting critical view of safety (CVS) in robot-assisted laparoscopic cholecystectomy (RLC) to improve procedural safety. We present a meta-auxiliary learning framework that delicately combines milestone recognition and anatomical segmentation to enhance contextual awareness. The framework addresses label imbalance by facilitating knowledge sharing across tasks, ensuring balanced optimization. A curated RLC dataset was utilized to evaluate CVS identification and multi-instance segmentation performance. The proposed method achieved an F1 score of 78% for CVS detection and a mean IOU of 83.9% for anatomical segmentation, demonstrating its efficacy in complex surgical environments. This framework establishes a new paradigm for surgical video analysis by integrating milestone detection and segmentation. Its ability to enhance decision support and procedural review in RLC highlights its potential for broader adoption in clinical practice.
- Research Article
1
- 10.3390/biology14070771
- Jun 26, 2025
- Biology
- Roland Juhos + 3 more
The presence of aggressive behavior in livestock creates major difficulties for animal welfare, farm safety, economic performance and selective breeding. The two innovative tools of deep learning-based video analysis and transcriptomic profiling have recently appeared to aid the understanding and monitoring of such behaviors. This scoping review assesses the current use of these two methods for aggression research across livestock species and identifies trends while revealing unaddressed gaps in existing literature. A scoping literature search was performed through the PubMed, Scopus and Web of Science databases to identify articles from 2014 to April 2025. The research included 268 original studies which were divided into 250 AI-driven behavioral phenotyping papers and 18 transcriptomic investigations without any studies combining both approaches. Most research focused on economically significant species, including pigs and cattle, yet poultry and small ruminants, along with camels and fish and other species, received limited attention. The main developments include convolutional neural network (CNN)-based object detection and pose estimation systems, together with the transcriptomic identification of molecular pathways that link to aggression and stress. The main barriers to progress in the field include inconsistent behavioral annotation and insufficient real-farm validation together with limited cross-modal integration. Standardized behavior definitions, together with multimodal datasets and integrated pipelines that link phenotypic and molecular data, should be developed according to our proposal. These innovations will speed up the advancement of livestock welfare alongside precision breeding and sustainable animal production.
- Research Article
- 10.55041/isjem04127
- Jun 7, 2025
- International Scientific Journal of Engineering and Management
- Sakshi Mohite
Abstract This project aims to develop a deep learning-based video summarization system that utilizes Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to analyze video content and generate concise summaries. The system will automatically identify key objects, events, and scenes in videos, and create summaries that capture the essential information. The project will explore various deep learning architectures and techniques to improve the quality and efficiency of video summarization.
- Research Article
- 10.1016/j.jort.2025.100890
- Jun 1, 2025
- Journal of Outdoor Recreation and Tourism
- Hugo Moreno + 2 more
Deep learning-based video analysis for visitor detection and tracking in protected areas
- Research Article
- 10.47191/ijmra/v8-i05-67
- May 31, 2025
- INTERNATIONAL JOURNAL OF MULTIDISCIPLINARY RESEARCH AND ANALYSIS
- Adia Adia + 4 more
In the current learning process, the use of innovative learning media is needed to improve students' understanding and learning outcomes. Civics subjects are often considered boring because they contain a lot of theoretical and abstract material. The purpose of the research is whether the use of problem-based learning-based video media affects student learning outcomes. The method of this research used observation, documentation, and tests. The research findings show that the low learning outcomes of students in Civics subjects are thought to be caused by the lack of interesting learning methods and media used. Using video media based on Problem Based Learning (PBL) in Civics learning to present real problems, stimulate students' critical thinking, improve student learning outcomes.
- Research Article
- 10.1093/bjs/znaf092.022
- May 16, 2025
- British Journal of Surgery
- L Schewski + 5 more
Abstract Background Accurate identification of intraoperative behaviors is crucial for assessing surgical performance, improving patient outcomes, and supporting surgical training. Traditional methods for evaluating intraoperative behaviors rely on experts' on-site observations or assessments of video recordings. Although these methods have been shown to be reliable, they are time-consuming, prone to bias, and limited in scalability. Video recordings of the operating room (OR), combined with methodological advancements in computer vision and machine learning, offer promising opportunities for automated, objective, and scalable behavior analysis. Aims This study explores the feasibility of automated approaches for assessing teamwork-related intra-operative behaviors in the OR. In a stepwise approach, we aim to automatically: 1) detect the positions and poses of the OR team members, 2) analyze movements and distribution patterns of the OR team, 3) determine their roles and functions, and 4) recognize structured team communication (e.g. team timeouts, briefings). Methods A multi-view OR dataset with over 100 hours of video recordings was created at a Swiss university hospital, featuring annotations of team interactions during real surgical procedures. Using deep learning-based video techniques, a multidisciplinary team of work psychologists, computer scientists, and surgeons detects and analyzes key events of interest. Results A framework for automatic video analysis was developed and validated using the created dataset. The experimental results show that our framework provides a valuable and efficient alternative to existing state-of-the-art approaches for both surgical role classification and team communication detection tasks. Conclusion We present a novel pipeline that automatically classifies the roles of the OR team members and detects behavioral team interactions. This work highlights the potential of automated approaches to revolutionize surgical practice and education by providing scalable, objective insights into non-technical skills.
- Research Article
- 10.53759/7669/jmc202505076
- Apr 5, 2025
- Journal of Machine and Computing
- Bairavel S + 5 more
Deep Learning (DL) is revolutionizing video processing, as video is progressively key in daily life. Encoding and transmitting video effectively becomes challenging with fast content resolution and data volume. This research presents the most progressive method for Video Compressing (VC), using DL to enhance encoding and transmission efficiency, demonstrating the need for more cutting-edge methods in digital media. This work uses advanced Machine Learning (ML) to reduce video data size without compromising video quality, enhancing its suitability for high-definition streaming and videoconferencing. The algorithm uses Convolutional Neural Network (CNN)+Recurrent Neural Network (RNN) to improve video quality. CNN captures complex spatial details within each video frame, while LSTM relates across time. The proposed VC achieves high video quality rates compared to traditional methods like H.264 and H.265. It adapts in real-time and optimizes video bandwidth usage, making it useful for live streaming services and video conferencing. The VC has been tested extensively, demonstrating significant bit rate reduction while maintaining excellent video quality. It surpasses modern compression methods, making it a flexible solution to the increasing demands for the best video content. This invention in VC is expected to change digital media distribution for good.
- Research Article
38
- 10.1109/tcsvt.2022.3229079
- Apr 1, 2025
- IEEE Transactions on Circuits and Systems for Video Technology
- Hadi Amirpour + 2 more
InIn HTTP Adaptive Streaming (HAS), each video is divided into smaller segments, and each segment is encoded at multiple pre-defined bitrates to construct a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">bitrate ladder</i> . To optimize bitrate ladders, per-title encoding approaches encode each segment at various bitrates and resolutions to determine the convex hull. From the convex hull, an optimized bitrate ladder is constructed, resulting in an increased Quality of Experience (QoE) for end-users. With the ever-increasing efficiency of deep learning-based video enhancement approaches, they are more and more employed at the client-side to increase the QoE, specifically when GPU capabilities are available. Therefore, scalable approaches are needed to support end-user devices with both CPU and GPU capabilities (denoted as CPU-only and GPU-available end-users, respectively) as a new dimension of a bitrate ladder. To address this need, we propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeepStream</i> , a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">scalable content-aware</i> per-title encoding approach to support both CPU-only and GPU-available end-users. ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</i> ) To support <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">backward compatibility</i> , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeepStream</i> constructs a bitrate ladder based on any existing per-title encoding approach. Therefore, the video content will be provided for legacy end-user devices with CPU-only capabilities as a base layer (BL). ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ii</i> ) For high-end end-user devices with GPU capabilities, an enhancement layer (EL) is added on top of the base layer comprising lightweight video super-resolution deep neural networks (DNNs) for each bitrate-resolution pair of the bitrate ladder. A content-aware video super-resolution approach leads to higher video quality, however, at the cost of bitrate overhead. To reduce the bitrate overhead for streaming content-aware video super-resolution DNNs, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeepCABAC</i> , context-adaptive binary arithmetic coding for DNN compression, is used. Furthermore, the similarity among ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</i> ) segments within a scene and ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ii</i> ) frames within a segment are used to reduce the training costs of DNNs. Experimental results show bitrate savings of 34% and 36% to maintain the same PSNR and VMAF, respectively, for GPU-available end-users, while the CPU-only users get the desired video content as usual.
- Research Article
7
- 10.1145/3715144
- Mar 10, 2025
- ACM Transactions on Multimedia Computing, Communications, and Applications
- Lv Tang + 2 more
Recently, many works have applied deep learning techniques to video compression tasks, achieving promising results and advancing the field of Deep Learning-Based Video Compression (DLVC). However, the architecture design of the existing DLVC is rigid and limited in terms of flexibility. Specifically, different networks must be designed for different scenarios, such as delay-constrained scenario or non-delay-constrained scenario. Frequent switching between networks would reduce the speed of modern deep learning platforms and increase the maintenance costs. To address this problem, we propose a Unified Video Compression (UVC) framework that can be freely switched to different application scenarios without changing the network architecture. Our proposed UVC framework is based on the explicit-compression and implicit-generation perspective, which contains two sub-networks—the Explicit Reference Frame Compression Network (ERFCN) and the Implicit Reference Frame Generation Network (IRFGN). The aim of ERFCN is to compress the current frame with the help of the reference frame. To improve the performance of ERFCN, we first introduce the Transformer in this network, which can fully remove the spatial redundancy of the input image and is beneficial for the following inter-prediction process. We also develop a novel long-range motion estimation module for inter-prediction to generate motion vectors based on global motion information between two frames, which can handle long-range complex motion relations. The aim of IRFGN is to capture the temporal relationship between forward and backward reconstructed frames and synthesize a high-quality implicit reference frame for the current frame. To achieve this, we design the split spatial-temporal attention and multi-scale prediction module. We conduct extensive experiments on three widely used video compression databases (HEVC, UVG, and MCL-JVC), and the results demonstrate the superiority of our approach over other related DLVC methods.
- Research Article
- 10.31449/inf.v49i10.7146
- Jan 28, 2025
- Informatica
- Jingmin Gong + 1 more
With the increasing demand for high-definition video, video super-resolution technology has become a key means to improve video picture quality. Traditional video super-resolution methods are limited by computational resources and model complexity, which struggle to meet the demands of modern video processing. In recent years, the rise of deep learning technology has brought a revolutionary breakthrough for video super-resolution. In this paper, we propose a deep learning-based video superresolution reconstruction method that combines Transformer, cross-modal learning and fusion, and an attention mechanism. We design the Temporal Transformer-based Video Super-Resolution (TT-VSR) architecture, which significantly improves the accuracy and detail richness of video reconstruction by integrating the Transformer's self-attention mechanism with CNN's spatial feature extraction capabilities. The introduction of cross-modal learning and fusion, along with the cross-modal attention mechanism, further enhances the model's adaptability to complex scenes and detail recovery ability. Experimental results demonstrate that our model outperforms existing methods, achieving a PSNR of X dB and an SSIM of Y, indicating substantial improvements in image quality. These results validate the efficacy of our approach and open a new path for the development of video super-resolution technology.