Articles published on Video recognition
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
938 Search results
Sort by Recency
- New
- Research Article
- 10.3390/bdcc10030073
- Feb 28, 2026
- Big Data and Cognitive Computing
- Yujie Liu + 3 more
Large-scale pretrained foundation models are increasingly essential for affective analysis in user-generated videos. However, current approaches typically reuse generic multi-modal representations directly with task-specific adapters learned from scratch, and their performance is limited by the large affective domain gap and scarce emotion annotations. To address these issues, we introduce a novel paradigm that leverages auxiliary cross-modal priors to enhance unimodal emotion modeling, effectively exploiting modality-shared semantics and modality-specific inductive biases. Specifically, we propose a progressive prototype evolution framework that gradually transforms a neutral prototype into discriminative emotional representations through fine-grained cross-modal interactions with visual cues. The auxiliary prior serves as a structural constraint, reframing the adaptation challenge from a difficult domain shift problem into a more tractable prototype shift within the affective space. To ensure robust prototype construction and guided evolution, we further design category-aggregated prompting and bidirectional supervision mechanisms. Extensive experiments on VideoEmotion-8, Ekman-6, and MusicVideo-6 validate the superiority of our approach, achieving state-of-the-art results and demonstrating the effectiveness of leveraging auxiliary modality priors for foundation-model-based emotion recognition.
- New
- Research Article
- 10.1007/s40998-026-01029-y
- Feb 12, 2026
- Iranian Journal of Science and Technology, Transactions of Electrical Engineering
- M Shanmughapriya + 1 more
Enhanced Action Recognition in Videos Using Deep Hybrid Architecture with Modified SegNet and Improved Feature Set
- Research Article
- 10.1038/s41598-026-38947-0
- Feb 5, 2026
- Scientific reports
- Shenzhen Ding + 5 more
Reconstruction strategy of vehicle trajectory data for video recognition based on a two-step method of interpolation filtering.
- Research Article
- 10.3390/app16031289
- Jan 27, 2026
- Applied Sciences
- Mirela-Magdalena Grosu (Marinescu) + 3 more
Emotion recognition in video (ERV) aims to infer human affect from visual, audio, and contextual signals and is increasingly important for interactive and intelligent systems. Over the past decade, ERV has evolved from handcrafted features and task-specific deep learning models toward transformer-based vision–language models and multimodal large language models (MLLMs). This review surveys this evolution, with an emphasis on engineering considerations relevant to real-world deployment. We analyze multimodal fusion strategies, dataset characteristics, and evaluation protocols, highlighting limitations in robustness, bias, and annotation quality under unconstrained conditions. Emerging MLLM-based approaches are examined in terms of performance, reasoning capability, computational cost, and interaction potential. By comparing task-specific models with foundation model approaches, we clarify their respective strengths for resource-constrained versus context-aware applications. Finally, we outline practical research directions toward building robust, efficient, and deployable ERV systems for applied scenarios such as assistive technologies and human–AI interaction.
- Research Article
- 10.30935/ojcmt/17737
- Jan 14, 2026
- Online Journal of Communication and Media Technologies
- Alberto Sanchez-Acedo + 5 more
Generative artificial intelligence (Gen-AI) tools have a significant impact on the creation of audiovisual content. Although these tools are still at an early stage in video production, there are tools such as Sora (OpenAI) that demonstrate the great potential of Gen-AI to create advanced audiovisual content. This study evaluates through a comparative analysis the level of realism, attractiveness and composition of the videos generated by Sora compared to real videos. Using a questionnaire validated by experts (n = 12), a quasi-experiment was conducted with college students (n = 62) who were divided into two groups: a control group that visualized real videos from YouTube and an experimental group that visualized videos created with the Sora tool. The results show that attractiveness, particularly the elements of lighting, saturation and color, are key factors in the recognition of a Gen-AI video. The paper concludes that Gen-AI tools should focus on improving the attractive elements to achieve more consistent and natural results.
- Research Article
- 10.1016/j.neucom.2025.131150
- Jan 1, 2026
- Neurocomputing
- Xiaolin Zhu + 7 more
Deep learning-based group activity recognition in videos: A survey
- Research Article
- 10.38094/jastt605659
- Dec 31, 2025
- Journal of Applied Science and Technology Trends
- Gnana Rubini R + 5 more
Earth and environmental monitoring are very crucial to identify changes in climatic conditions, destruction of an ecosystem and calamities. The increased access to high-resolution satellite, aerial, and UAV imagery requires sophisticated intelligent visual analytics that can be used to derive actionable information on the basis of massive streams of remote-sensed data. The current image and video recognition methods are not always capable of attaining reliable performances in the presence of multimodal data heterogeneity, environmental dynamics, and interference of noise in remote-sensing images. These issues restrict the precision and flexibility of traditional deep learning-based monitoring systems to real-life applications. In this paper, we have suggested the Enhanced Visual Intelligence for Adaptive Recognition Network (EVIAR-Net). This deep learning model is a hybrid one that uses Graph-Convolutional Vision Transformers (GCVT) and Adaptive Multi-Source Fusion (AMSF). EVIAR-Net is able to store spatial correlations along with temporal dependencies using the graph-based spatial reasoning and transformer-based temporal encoding. AMSF actively combines multispectral, hyperspectral and video modalities to provide resistance to illumination, motion, and atmospheric perturbations. Performance assessments of various Earth observation datasets indicate an improvement in recognition accuracy of 21 percent, inference speed of 30 percent, and generalisation to unknown environments are better than CNN, ViT, and LSTM-based models. The suggested EVIAR-Net concept exhibits a smart, adaptable, and energy-saving strategy towards the next-generation environmental monitoring and predictive analytics.
- Research Article
- 10.34139/jscs.2025.15.4.67
- Dec 31, 2025
- Society for Standards Certification and Safety
- Jungyung Kim + 1 more
Recent deep learning-based video recognition technologies, driven by advancements in deep neural networks such as 3D CNNs, have achieved superhuman accuracy. However, the increasing scale of these models has led to massive computational costs and power consumption. Furthermore, the "black-box" nature of their complex inference processes limits their application in high-reliability fields like autonomous driving and healthcare. To overcome these limitations, this study proposes a novel video recognition model that secures both intelligent efficiency and explainability by applying the biological brain's principles of 'Selective Attention' and 'Functional Specialization' to deep neural network design.We first replicated the study by Hiramoto & Cline (2024) and conducted an in-depth analysis of neural data from the optic tectum using unsupervised learning techniques. This statistically verified that neurons differentiate into 'expert groups' that respond only to specific spatiotemporal patterns, such as static backgrounds, horizontal movements, or complex rotations. To implement these biomimetic principles, we designed the Spatially Adaptive MovieNet, which actively selects the optimal computational path by analyzing the dynamic complexity of input videos in real-time. The core Intelligent Gating Module detects high-information regions within the video and employs a Winner-Takes-All mechanism to physically execute only one computational path—either 2D (static) or 3D (dynamic)—thereby realizing substantial acceleration instead of using a probability-based weighted sum. Furthermore, a multi-objective learning strategy including Sparsity Loss was established to induce the model to focus on motion, the key feature of the data, by enforcing sparsity in attention maps.Comparative experiments with a Standard 3D CNN demonstrated that the proposed model achieved the same top classification accuracy of 98.84% while reducing the number of parameters by approximately 98% (from 7.35M to 0.14M) and computational cost (FLOPs) from 0.27G to 0.22G. Notably, the proposed model demonstrated adaptive capability by self-selecting lighter computational modes for simple data and clearly presented the rationale for its decisions by accurately visualizing the motion trajectories of the subject through sparsity loss. This study presents a significant direction for future lightweight AI research for low-power edge devices by integrating neuroscientific insights into deep learning architectures to prove Intelligent Efficiency.
- Research Article
- 10.3390/app16010265
- Dec 26, 2025
- Applied Sciences
- Shuai Zhang + 5 more
Traditional water hazard monitoring often relies on manual inspection and water level sensors, typically lacking in accuracy and real-time capabilities. However, the method of using video surveillance for monitoring water hazard characteristics can compensate for these shortcomings. Therefore, this study proposes a method to detect water hazards in mines using video recognition technology, combining temporal and spatial descriptors to enhance recognition accuracy. This study employs residual preprocessing technology to effectively eliminate complex underground static backgrounds, focusing solely on dynamic water flow features, thereby addressing the issue of the absence of water inrush samples. The method involves analyzing dynamic water flow pixels and applying an iterative denoising algorithm to successfully remove discrete noise points while preserving connected water flow areas. Experimental results show that this method achieves a detection accuracy of 90.68% for gushing water, significantly surpassing methods that rely solely on temporal or spatial descriptors. Moreover, this method not only focuses on the temporal characteristics of water flow but also addresses the challenge of detection difficulties due to the lack of historical gushing water samples. This research provides an effective technical solution and new insights for future water gushing monitoring in mines.
- Research Article
- 10.37614/2949-1215.2025.16.3.003
- Dec 25, 2025
- Transaction Kola Science Centre
- Olga N Zuenko + 1 more
The article provides an analytical overview of subspace clustering methods that allow to process high-dimensional data characterized by a large number of features and their values. The methods provide the ability to analyze missing and noisy data. Clustering is performed not in the full feature space, but in its projections, without replacing the original set of features with their linear combinations. This allows reducing the dimensionality of the feature space under consideration while maintaining the ability for the user to interpret the clustering results. The main stages of the clustering process within the considered methods are highlighted and described in detail. Attention is paid to the use of additional user constraints to improve the accuracy of the resulting partitions. The analyzed methods are widely used in various data mining problems, such as image and video recognition, text processing, and genome research.
- Research Article
- 10.26689/jera.v9i6.13152
- Dec 16, 2025
- Journal of Electronic Research and Application
- Xiang Gao + 3 more
Aiming at the problems faced by construction site video management in the recognition of cigarette butts, reflective vests, and other objects, such as small target confusion, high-brightness false alarms, occlusion missed detections, and poor adaptability to complex environments, this study proposes a recognition accuracy optimization algorithm based on multimodal fusion. The research constructs a dataset containing three modalities of data: visible light, infrared, and millimeter-wave. The Dust-GAN algorithm is adopted to realize dust removal and enhancement of dusty images, and the SAA module is introduced into YOLOv8-s to improve the small target recall rate. Meanwhile, three-modal feature fusion is achieved, and channel pruning and quantization-aware training are used to realize algorithm lightweighting. The algorithm was deployed and operated on-site for 3 months, effectively reducing the construction site safety accident rate by 65%, which provides a solution for safety management and control in smart construction sites under complex environments.
- Research Article
- 10.47772/ijriss.2025.91100289
- Dec 8, 2025
- International Journal of Research and Innovation in Social Science
- Lum Fu Yuan + 2 more
Smartstock is an intelligent warehouse management system developed to overcome long-standing challenges in stock visibility, manual inventory counting, and inefficient resource utilization. The system integrates artificial intelligence, computer vision, and image processing to enable real-time stock detection, customer presence tracking, and parking lot traffic monitoring. Designed to enhance operational transparency and decision-making efficiency, SMARTSTOCK automates key warehouse functions traditionally dependent on manual supervision. The architecture comprises two core modules: an AI-based Video Recognition Module and a Real-Time Visibility and Reporting Module. The AI-based Video Recognition Module incorporates three sub-modules—Stock Detection and Tracking, Client Presence Counting, and Car Park Counting—which employ YOLOv8 models for object detection and the SORT algorithm for object tracking and counting. Experimental evaluation demonstrated high reliability and accuracy, achieving 90.16% for stock tracking, 95% for client presence counting, and 100% for car park occupancy detection. The Real-Time Visibility and Reporting Module provides a unified dashboard for data visualization, live monitoring, and decision support, significantly reducing human error and out-of-stock occurrences. Despite its strong performance, SMARTSTOCK faces limitations related to hardware dependency and difficulty in detecting low-stock items under full occlusion. Future enhancements will focus on cloud-based implementation, model optimization, and integration with point-of-sale systems to achieve comprehensive inventory intelligence. Overall, SMARTSTOCK represents a robust, explainable, and scalable AI-driven framework that advances warehouse automation, improves resource utilization, and strengthens real-time decision-making within retail environments.
- Research Article
3
- 10.1016/j.media.2025.103716
- Dec 1, 2025
- Medical image analysis
- Adrito Das + 32 more
PitVis-2023 challenge: Workflow recognition in videos of endoscopic pituitary surgery.
- Research Article
4
- 10.1109/tpami.2025.3600702
- Dec 1, 2025
- IEEE transactions on pattern analysis and machine intelligence
- Yiyuan Zhang + 2 more
This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior design strategy. Our work introduces a set of architecture design guidelines for large-kernel ConvNets that optimize their efficiency and performance. We propose the UniRepLKNet architecture, which offers systematical architecture design principles specifically crafted for large-kernel ConvNets, emphasizing their unique ability to capture extensive spatial information without deep layer stacking. This results in a model that not only surpasses its predecessors with an ImageNet accuracy of 88.0%, an ADE20 K mIoU of 55.6%, and a COCO box AP of 56.4% but also demonstrates impressive scalability and performance on various modalities such as time-series forecasting, audio, point cloud, and video recognition. These results indicate the universal modeling abilities of large-kernel ConvNets with faster inference speed compared with vision transformers. Our findings reveal that large-kernel ConvNets possess larger effective receptive fields and a higher shape bias, moving away from the texture bias typical of smaller-kernel CNNs.
- Research Article
4
- 10.1016/j.patcog.2025.111725
- Dec 1, 2025
- Pattern Recognition
- Dan Liu + 5 more
SAM-Net: Semantic-assisted multimodal network for action recognition in RGB-D videos
- Research Article
- 10.1186/s43067-025-00295-w
- Nov 28, 2025
- Journal of Electrical Systems and Information Technology
- Mayada Khairy + 2 more
Abstract The increasing popularity of multimedia applications, such as video classification, has underscored the need for efficient methods to manage and categorize vast video datasets. Video classification simplifies video categorization, enhancing searchability and retrieval by leveraging distinctive features extracted from textual, audio, and visual components. This paper introduces an automated video recognition system that classifies video content based on motion types (low, medium, and high) derived from visual component characteristics. The proposed system utilizes advanced artificial intelligence techniques with four feature extraction methods; MFCC alone, (2) MFCC after applying DWT, (3) denoised MFCC, and (4) MFCC after applying denoised DWT. And seven classification algorithms to optimize accuracy. A novel aspect of this study is the application of Mel Frequency Cepstral Coefficients (MFCC) to extract features from the video domain rather than their traditional use in audio processing, demonstrating the effectiveness of MFCC for video classification. Seven classification techniques, including K-Nearest Neighbors (KNN), Radial Basis Function Support Vector Machines (SVM-RBF), Parzen Window Method, Neighborhood Components Analysis (NCA), Multinomial Logistic Regression (ML), Linear Support Vector Machines (SVM Linear), and Decision Trees (DT), are evaluated to establish a robust classification framework. Experimental results indicate that this denoising-enhanced system significantly improves classification accuracy, providing a comprehensive framework for future applications in multimedia management and other fields.
- Research Article
- 10.1038/s41598-025-27031-8
- Nov 25, 2025
- Scientific reports
- Mohd Aquib Ansari + 6 more
Surveillance systems play a crucial role in detecting suspicious human activities, including attacks, violence, and abductions, in public spaces. This study presents a human intervention-free, hybrid framework that utilizes deep neural networks for real-time theft activity recognition. The proposed methodology employs a dual stream fusion network, combining appearance and motion features, to accurately identify theft actions. Specifically, a modified InceptionV3 model extracts relevant body pose features through keypoint transfer, feeding two separate deep neural network pipelines for appearance and motion analysis. Long-Short-Term Memory network then models temporal relationships between the extracted features across consecutive frames. The novelty of this research lies in the proposed dual-stream fusion architecture, which aims to capture fine-grained temporal and spatial cues for theft detection. A new lab-lifting dataset has also been developed to reflect subtle theft behaviors in academic settings. The framework's performance is evaluated on a dataset comprising normal and theft activities. The results demonstrate a recognition accuracy of 91.86% , surpassing that of other methods.
- Research Article
- 10.3390/electronics14234589
- Nov 23, 2025
- Electronics
- Nada Alzahrani + 2 more
Isolated Sign Language Recognition (ISLR), which focuses on identifying individual signs from sign language videos, presents substantial challenges due to small and ambiguous hand regions, high visual similarity among signs, and large intra-class variability. This study investigates the adaptability of YOLO-Act, a unified spatiotemporal detection framework originally developed for generic action recognition in videos, when applied to large-scale sign language benchmarks. YOLO-Act jointly performs signer localization (identifying the person signing within a video) and action classification (determining which sign is performed) directly from RGB sequences, eliminating the need for pose estimation or handcrafted temporal cues. We evaluate the model on the WLASL2000 and MSASL1000 datasets for American Sign Language recognition, achieving Top-1 accuracies of 67.07% and 81.41%, respectively. The latter represents a 3.55% absolute improvement over the best-performing baseline without pose supervision. These results demonstrate the strong cross-domain generalization and robustness of YOLO-Act in complex multi-class recognition scenarios.
- Research Article
- 10.1007/s00464-025-12341-9
- Nov 3, 2025
- Surgical endoscopy
- Jae Hyun Kwon + 5 more
To optimize surgical procedures and prevent retained surgical instruments, precise identification of required instruments during surgical treatment is essential. However, establishing ground truth data can be a labor-intensive barrier for researchers. Therefore, we developed and evaluated a novel system for detecting laparoscopic surgical instruments during laparoscopic cholecystectomy through virtual image creation. Virtual images were created by synthesizing laparoscopic instrument photos with surgical video backgrounds. The 311 instrument images and 1610 background images from 52 patients were augmented through random brightness, contrast, crop, rotation, scaling, flipping, and perspective transformations, resulting in 6023 composite images. These data were split into training, tuning, and internal test sets. Based on synthetic data, we developed a system comprising two-step processes. The first model is a unified instrument localization model that detects surgical instruments, and the second model is an instrument-type classification model that categorizes the detected surgical instruments. External and public datasets were used to evaluate generalizability. The unified instrument localization model achieved average precision (AP) values with intersection over union (IoU) of 0.5 of 0.981, 0.882, and 0.689 for internal, external, and public datasets, respectively. The instrument-type classification model demonstrated area under the curve (AUC) values of 0.959 for seven instrument types in the external dataset and 0.749 for four instrument types in the public dataset. The final two-step instrument detection model demonstrated an AUC of 0.848 for the external dataset and 0.688 for the public dataset, which showed significantly superior performance compared to conventional multi-class instrument models. This validated deep learning model using synthetically generated data provides a reliable framework for surgical instrument detection. Our approach demonstrates strong performance and generalizability, suggesting its potential for improving operative workflow efficiency and surgical education across various minimally invasive procedures.
- Research Article
1
- 10.1007/s11633-025-1555-3
- Oct 21, 2025
- Machine Intelligence Research
- Xiao Wang + 7 more
Unleashing the Power of CNN and Transformer for Balanced RGB-event Video Recognition