Articles published on Robust Recognition
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
2847 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.wasman.2025.115245
- Jan 15, 2026
- Waste management (New York, N.Y.)
- Jun He + 4 more
Robust referring image segmentation for construction and demolition waste recognition.
- New
- Research Article
- 10.1088/1361-6501/ae26ba
- Jan 7, 2026
- Measurement Science and Technology
- Zheng Tian + 2 more
Abstract Heterogeneous gear fault vibration signals are often influenced by the coupling of multi-level modulation sources, which severely hinders the separation of informative features and obstructs accurate tracking of their intrinsic temporal evolution. 
Moreover, current research in fault diagnosis lacks effective co-modeling strategies that can jointly characterize the multi-source coupling mechanisms and their dynamic progression, making it difficult to decouple fault-relevant patterns from interfering components in such modulated signals.To address this issue, we propose a physics-guided spatiotemporal decoupling framework, 
which integrates spatiotemporal modeling with the underlying physical modulation mechanisms. First, inspired by the kinematic characteristics of gear systems, a Heuristic Selective Decision Solver (HSDS) is designed to extract modulation patterns dominated by rotational periodicity, thereby achieving effective isolation of multi-source interference. Next, the extracted patterns are transformed into physically meaningful enhanced representations via a spatiotemporal encoder, which simultaneously preserves the local structural details and temporal evolution of fault features. 
Furthermore, a Spatiotemporal Coupling Capture Network (SCCN) is developed, incorporating a physically salient attention mechanism to adaptively emphasize critical fault-related components, significantly improving the discriminative capability of the learned features. Finally, a neural operator-based classifier is employed to accomplish robust fault recognition. Experimental results demonstrate that the proposed method achieves high diagnostic accuracy and robustness across a variety of complex operating conditions, enabling efficient and reliable fault diagnosis for gear transmission systems.
- New
- Research Article
- 10.1016/j.engappai.2025.113143
- Jan 1, 2026
- Engineering Applications of Artificial Intelligence
- Daoxiang Zhou + 4 more
Multi-scale orthogonal Gabor filters based ConvNets for illumination robust single sample face recognition
- New
- Research Article
- 10.1088/2631-8695/ae30cb
- Jan 1, 2026
- Engineering Research Express
- Chandrashekar M Patil + 1 more
Abstract Iris recognition is one of the most reliable biometric identification technique due to the uniqueness and stability of its patterns. Deep learning techniques have since emerged as a powerful approach for developing more accurate and robust iris recognition systems. The real-time implementation of deep architectures for iris recognition presents notable challenges, primarily due to the substantial computational and memory demands. In this paper, we present a novel lightweight deep convolutional neural network architecture to effectively address the trade-off between classification accuracy and computation complexity. The pre-processing pipeline employed in the work is aimed at accurately localizing and segmenting the iris image. The pre-processing pipeline comprises of Circular Hough Transform (CHT) for precise iris localization, occlusion removal for handling eyelids and eyelashes, and Contrast-Limited Adaptive Histogram Equalization (CLAHE) for photometric enhancement. In the proposed deep architecture, the depthwise convolutions efficiently extract spatial features from each input channel independently, significantly reducing computational cost, while pointwise convolutions enable channel-wise information fusion to learn discriminative and compact feature representations. The traditional softmax layer is replaced with an SVM classifier using a Radial Basis Function (RBF) kernel, which enhances non-linear decision boundary learning and generalization capability through the max-margin principle. The proposed model has outperformed state-of-the-art pretrained models with a recognition accuracy of 99.3% and equal error rate (EER) of 0.3% on a multi-source benchmark iris dataset (CASIA and MMU1) and demonstrates strong cross-sensor interoperability. The proposed framework offers a promising solution for real-time iris recognition in applications with limited computational resources.
- New
- Research Article
- 10.1016/j.engappai.2025.113164
- Jan 1, 2026
- Engineering Applications of Artificial Intelligence
- Zige Luo + 5 more
A day-night cross-modal network for robust commodity recognition under low-light illumination
- New
- Research Article
1
- 10.1016/j.compind.2025.104411
- Jan 1, 2026
- Computers in Industry
- Bo Zhu + 4 more
Robust non-contact material recognition for robots in extreme and dynamic environments
- New
- Research Article
- 10.3390/sym18010071
- Dec 31, 2025
- Symmetry
- Yifan Hu + 3 more
Source camera identification relies on sensor noise features to distinguish between different devices, but large-scale sample labeling is time-consuming and labor-intensive, making it difficult to implement in real-world applications. The noise residuals generated by different camera sensors exhibit statistical asymmetry, and the structured patterns within these residuals also show local symmetric relationships. Together, these features form the theoretical foundation for camera source identification. To address the problem of limited labeled data under few-shot conditions, this paper proposes a Cross-correlation Guided Augmentation and Prediction with Hybrid Bidirectional State-Space Model Attention (CGAP-HBSA) framework, based on the aforementioned symmetry-related theoretical foundation. The method extracts symmetric correlation structures from unlabeled samples and converts them into reliable pseudo-labeled samples. Furthermore, the HBSA network jointly models symmetric structures and asymmetric variations in camera fingerprints using a bidirectional SSM module and a hybrid attention mechanism, thereby enhancing long-range spatial modeling capabilities and recognition robustness. In the Dresden dataset, the proposed method achieves an identification accuracy for the 5-shot camera source identification task that is only 0.02% lower than the current best-performing method under few-shot conditions, MDM-CPS, and outperforms other classical few-shot camera source identification methods. In the 10-shot task, the method improves by at least 0.3% compared to MDM-CPS. In the Vision dataset, the method improves the identification accuracy in the 5-shot camera source identification task by at least 6% compared to MDM-CPS, and in the 10-shot task, it improves by at least 3% over the best-performing MDM-CPS method. Experimental results demonstrate that the proposed method achieves competitive or superior performance in both 5-shot and 10-shot settings. Additional robustness experiments further confirm that the HBSA network maintains strong performance even under image compression and noise contamination conditions.
- New
- Research Article
- 10.22266/ijies2025.1231.65
- Dec 31, 2025
- International Journal of Intelligent Engineering and Systems
Hybrid OCR with LLM -Enhanced Post Processing for Robust Text Recognition for Extreme Illumination Condition
- New
- Research Article
- 10.3390/s26010241
- Dec 30, 2025
- Sensors (Basel, Switzerland)
- Aki Shigesawa + 7 more
In snowy regions, road surface conditions change due to snowfall or ice formation in winter. This can lead to very dangerous situations when driving a car. Therefore, recognizing road surface conditions is important for both drivers and road managers. Road surface classification using in-vehicle cameras faces challenges due to the diverse environments in which vehicles operate. It is difficult to build a single classification model that can handle all conditions. One major challenge is illumination. During dusk, it changes rapidly and drastically, resulting in poor classification accuracy. Therefore, a robust method is needed to accurately recognize road conditions at all times. In this study, we used an image translation method to standardize illumination conditions. Next, we extracted features from both the translated images and the original images using MobileNet. Finally, we integrated these features using Late Fusion with an Extreme Learning Machine to classify road conditions. The effectiveness of this method was verified using a dataset of in-vehicle camera images. The results showed that the accuracy of this method achieved 78% during dusk and outperformed the comparison methods. It was confirmed that the uniformity of illumination conditions contributed to the improvement in classification accuracy. The proposed method can classify road conditions even during dusk, when sudden changes in illumination occur. This demonstrates the potential to realize a robust road condition recognition method that contributes to improved driver safety and efficient road management.
- New
- Research Article
- 10.71097/ijsat.v16.i4.10026
- Dec 29, 2025
- International Journal on Science and Technology
- Samadhan Ghodke + 3 more
Face recognition in unconstrained environments remains challenging due to occlusion, pose variations, illumination changes, and unreliable face alignment. This paper presents MSAP-Net, a hierarchical multi-scale adaptive preprocessing framework designed to enhance face recognition robustness under such conditions. The proposed method integrates color space normalization, adaptive face detection with intelligent upsampling, context-aware padding, landmark confidence estimation, and confidence-weighted face alignment prior to deep feature extraction. Unlike fixed preprocessing pipelines, MSAP-Net applies selective and adaptive preprocessing to preserve discriminative facial features and avoid feature degradation. Experimental evaluation on unconstrained face datasets demonstrates that refining landmark detection and preprocessing significantly improves verification performance, achieving a 7% increase in accuracy and a 10% improvement in AUC, with a corresponding reduction in equal error rate. The results confirm that adaptive preprocessing and reliable alignment play a crucial role in improving recognition robustness, particularly for face verification tasks. While identification performance remains limited due to feature discriminability constraints, MSAP-Net provides a practical and extensible foundation for robust, edge-deployable face recognition systems.
- New
- Research Article
- 10.3390/bdcc10010011
- Dec 29, 2025
- Big Data and Cognitive Computing
- Saksham Singla + 6 more
Real-time fine-grained human activity recognition (HAR) remains a challenging problem due to rapid spatial–temporal variations, subtle motion differences, and dynamic environmental conditions. Addressing this difficulty, we propose NovAc-DL, a unified deep learning framework designed to accurately classify short human-like actions, specifically, “pour” and “stir” from sequential video data. The framework integrates adaptive time-distributed convolutional encoding with temporal reasoning modules to enable robust recognition under realistic robotic-interaction conditions. A balanced dataset of 2000 videos was curated and processed through a consistent spatiotemporal pipeline. Three architectures, LRCN, CNN-TD, and ConvLSTM, were systematically evaluated. CNN-TD achieved the best performance, reaching 98.68% accuracy with the lowest test loss (0.0236), outperforming the other models in convergence speed, generalization, and computational efficiency. Grad-CAM visualizations further confirm that NovAc-DL reliably attends to motion-salient regions relevant to pouring and stirring gestures. These results establish NovAc-DL as a high-precision real-time-capable solution for deployment in healthcare monitoring, industrial automation, and collaborative robotics.
- New
- Research Article
- 10.1038/s42003-025-09411-y
- Dec 27, 2025
- Communications biology
- Yiyuan Zhang + 2 more
Balancing specificity and generalization in object recognition is a significant challenge for biological and artificial visual systems. Here, we investigated how the brain addresses this challenge by examining the relationship between interconnectivity of neural networks, dimensionality of neural space, and levels of abstraction in representing objects, employing combined neurophysiological data from macaques and computational modeling. We found that higher interconnectivity within area TEa of macaques' inferior temporal (IT) cortex was associated with lower dimensionality and greater generalization, while lower interconnectivity within area TEO correlated with higher dimensionality and greater specificity. To establish a causal link, we developed a brain-inspired computational model constrained by empirical wiring length. This structured interconnectivity created optimal dimensionality of the neural space, facilitating efficient energy distribution across the representational manifold embedded within the neural space, balancing specificity and generalization. Our findings underscore the critical role of structured connectivity in enabling robust object recognition through multi-level abstraction.
- New
- Research Article
- 10.3390/bioengineering13010029
- Dec 26, 2025
- Bioengineering
- Hyoung-Gook Kim + 1 more
Emotion recognition based on EEG signals remains a challenging task due to the complex spatiotemporal properties of brain activity and substantial intersubject variability. To address these challenges, we propose the EED-CL framework, which integrates an extended EEG-Deformer (EED) with contrastive learning (CL). The proposed model incorporates a depthwise separable convolution encoder for efficient extraction of spatiotemporal EEG features, a hierarchical coarse-to-medium-to-fine (HCMFT) transformer to capture multiscale temporal patterns, and an attentive dense information purification (ADIP) module to suppress noise and refine essential latent representations. In addition, CL-based pretraining facilitates robust feature learning even in settings with limited labeled data. The extracted multiscale features are integrated and classified through a Transformer encoder and an MLP. Experiments conducted on multiple benchmark EEG datasets show that EED consistently outperforms conventional models, while EED-CL achieves further improvements under label-constrained conditions. Notably, EED-CL demonstrates strong robustness to intersubject variability and noise, enabling stable emotion classification even when labeled samples are scarce. These findings indicate that EED-CL effectively captures multiscale spatiotemporal EEG patterns and offers a scalable and reliable approach for EEG-based emotion recognition.
- New
- Research Article
- 10.1371/journal.pone.0339277
- Dec 26, 2025
- PLOS One
- Liu Wenbo
Visual design element recognition and analysis play a critical role in various applications, ranging from creative design to cultural artifact preservation. However, existing methods often struggle with accurately identifying and understanding complex, multimodal design elements in real-world scenarios. To address this, we propose an integrated model that combines the Swin Transformer for precise image segmentation, multi-scale feature fusion for robust type recognition, and a multimodal large language model (LLM) for fine-grained image understanding. Experimental results on ETHZ Shape Classes, ImageNet, and COCO datasets demonstrate that the proposed model outperforms state-of-the-art methods, achieving 88.6% segmentation accuracy and a 92.3% F1 score in multimodal tasks. These findings highlight the model’s potential as an effective tool for advanced design element recognition and analysis. The source code for this study can be viewed at this url: https://github.com/LIU-WENBO/Multi-Feature-Design-Elements-Recognition.
- New
- Research Article
- 10.1002/adma.202520823
- Dec 26, 2025
- Advanced materials (Deerfield Beach, Fla.)
- Jie Liu + 3 more
Machine vision systems face significant challenges in accurately extracting critical features from dim objects under complex scenarios. Here, we demonstrate a ferroelectric-configured weight-reconfigurable photovoltaic device array for in-sensor dynamic computing, enabling robust recognition of dim objects. A series of 2D perovskite ferroelectric nanoplates with controllable size, high crystallinity, and excellent yield are directly synthesized. Reconfigurable and nonvolatile photovoltaics in a graphene/ferroelectric/graphene heterostructure are modulated through switchable ferroelectric polarization. Leveraging the ferroelectric-configured photoresponsivity, a convolution kernel optoelectronic sensor array with dynamic correlation of adjacent units is designed for in-sensor dynamic computing. Compared with traditional static optoelectronic convolution processing, our approach selectively amplifies subtle differences of local image pixels, enabling effective edge feature extraction even in low-contrast scenes. Integrated with a convolutional neural network, the system significantly enhances the robustness and accuracy of dim object detection, offering a promising platform for advanced machine vision applications.
- New
- Research Article
- 10.5120/ijca2025926133
- Dec 24, 2025
- International Journal of Computer Applications
- V Vathsala + 1 more
A Comprehensive Study on Integration of Segmentation and Enhancement Approaches for Robust Finger Vein Recognition
- New
- Research Article
- 10.3390/sym18010015
- Dec 21, 2025
- Symmetry
- Katherine Lin Shu + 1 more
Facial expression recognition (FER) is a key task in affective computing and human–computer interaction, aiming to decode facial muscle movements into emotional categories. Although deep learning-based FER has achieved remarkable progress, robust recognition under uncontrolled conditions (e.g., illumination change, pose variation, occlusion, and cultural diversity) remains challenging. Traditional Convolutional Neural Networks (CNNs) are effective at local feature extraction but limited in modeling global dependencies, while Vision Transformers (ViT) provide global context modeling yet often neglect fine-grained texture and frequency cues that are critical for subtle expression discrimination. Moreover, existing approaches usually focus on single-domain representations and lack adaptive strategies to integrate heterogeneous cues across spatial, semantic, and spectral domains, leading to limited cross-domain generalization. To address these limitations, this study proposes a unified Multi-Domain Feature Enhancement and Fusion (MDFEFT) framework that combines a ViT-based global encoder with three complementary branches—channel, spatial, and frequency—for comprehensive feature learning. Taking into account the approximately bilateral symmetry of human faces and the asymmetric distortions introduced by pose, occlusion, and illumination, the proposed MDFEFT framework is designed to learn symmetry-aware and asymmetry-robust representations for facial expression recognition across diverse domains. An adaptive Cross-Domain Feature Enhancement and Fusion (CDFEF) module is further introduced to align and integrate heterogeneous features, achieving domain-consistent and illumination-robust expression understanding. The experimental results show that the proposed method consistently outperforms existing CNN-, Transformer-, and ensemble-based models. The proposed model achieves accuracies of 0.997, 0.796, and 0.776 on KDEF, FER2013, and RAF-DB, respectively. Compared with the strongest baselines, it further improves accuracy by 0.3%, 2.2%, and 1.9%, while also providing higher F1-scores and better robustness in cross-domain testing. These results confirm the effectiveness and strong generalization ability of the proposed framework for real-world facial expression recognition.
- New
- Research Article
- 10.1038/s41598-025-31558-1
- Dec 21, 2025
- Scientific reports
- Jing Hao + 1 more
Sign language serves as a crucial mode of communication for the deaf and hard-of-hearing communities, enabling effective interaction in daily life. With the growing advancements in Artificial Intelligence (AI) and computer vision, there has been a significant shift toward automating SLR, making communication more accessible and inclusive. Traditional AI-based approaches, such as rule-based and statistical models, struggle to handle complex hand gestures, varying lighting conditions, and occlusions. Deep learning-based methods, particularly Convolutional Neural Networks (CNNs), have improved recognition capabilities, but they often fail to capture intricate spatial and temporal dependencies that are essential for accurate classification. To address these limitations, vision transformers (ViTs) have emerged as a breakthrough technology, offering superior feature extraction through self-attention mechanisms. Unlike conventional CNNs, ViTs efficiently model long-range dependencies, enabling robust sign recognition. This study proposes a Convolutional Vision Transformer (CvT)-based model that integrates hierarchical convolutional tokenization with transformer-based attention mechanisms, optimizing both local and global feature extraction. The proposed CvT model was evaluated on a publicly available sign language digits dataset, consisting of 1,712 images across 10 different classes along with alphabet and symbol dataset with 87,000 images of 29 classes. Empirical results indicate that with both datasets, the proposed model analysis CvT outperforms baseline models, achieving the highest accuracy of 99%, surpassing traditional CNN and transformer-based BeIT models. The findings demonstrate that CvT effectively reduces misclassifications, improves predictive confidence, and enhances generalization across training, validation, and test sets.
- Research Article
- 10.3390/s26010029
- Dec 19, 2025
- Sensors (Basel, Switzerland)
- Xuanhe Liu + 4 more
Accurate ship target recognition remains challenging in complex maritime environments due to background clutter, multiscale target appearance, and limited discriminative features extracted by single-type networks. To address these issues, this paper proposes a hierarchical local-global feature fusion network (HLGF-Net) that integrates local structural cues from a CNN encoder with global semantic dependencies modeled by a Transformer. The proposed model progressively constructs hierarchical dependencies through stacked Transformer blocks, enabling comprehensive integration of local structural details and global semantic context. This design enhances the capability to capture fine-grained local contours and long-range global contextual relationships simultaneously. Extensive experiments on ship recognition datasets demonstrate that HLGF-Net achieves superior performance compared with traditional CNNs, pure Transformers, and representative recent vision architectures, particularly under conditions of cluttered backgrounds, partial occlusion, and limited target samples. The proposed framework provides an effective solution for robust maritime target recognition and offers a general strategy for hierarchical local-global feature integration.
- Research Article
- 10.3390/technologies14010003
- Dec 19, 2025
- Technologies
- Lekshmi Chandrika Reghunath + 3 more
Identifying instruments in polyphonic audio is challenging due to overlapping spectra and variations in timbre and playing styles. This task is central to music information retrieval, with applications in transcription, recommendation, and indexing. We propose a dual-branch Convolutional Neural Network (CNN) that processes Mel-spectrograms and binary pitch masks, fused through a cross-attention mechanism to emphasize pitch-salient regions. On the IRMAS dataset, the model achieves competitive performance with state-of-the-art methods, reaching a micro F1 of 0.64 and a macro F1 of 0.57 with only 0.878M parameters. Ablation studies and t-SNE analyses further highlight the benefits of cross-modal attention for robust predominant instrument recognition.