Published in last 50 years
Articles published on Multi-scale Features
- New
- Research Article
- 10.3390/s25216598
- Oct 27, 2025
- Sensors
- Fang Wan + 5 more
Recent advances in 3D Gaussian Splatting (3DGS) have achieved remarkable performance in scene reconstruction and novel view synthesis on benchmark datasets. However, real-world images are frequently affected by degradations such as camera shake, object motion, and lens defocus, which not only compromise image quality but also severely hinder the accuracy of 3D reconstruction—particularly in fine details. While existing deblurring approaches have made progress, most are limited to addressing a single type of blur, rendering them inadequate for complex scenarios involving multiple blur sources and resolution degradation. To address these challenges, we propose Gaussian Splatting with Multi-Scale Deblurring and Resolution Enhancement (GS-MSDR), a novel framework that seamlessly integrates multi-scale deblurring and resolution enhancement. At its core, our Multi-scale Adaptive Attention Network (MAAN) fuses multi-scale features to enhance image information, while the Multi-modal Context Adapter (MCA) and adaptive spatial pooling modules further refine feature representation, facilitating the recovery of fine details in degraded regions. Additionally, our Hierarchical Progressive Kernel Optimization (HPKO) method mitigates ambiguity and ensures precise detail reconstruction through layer-wise optimization. Extensive experiments demonstrate that GS-MSDR consistently outperforms state-of-the-art methods under diverse degraded scenarios, achieving superior deblurring, accurate 3D reconstruction, and efficient rendering within the 3DGS framework.
- New
- Research Article
- 10.3390/s25216603
- Oct 27, 2025
- Sensors
- Yuduo Lin + 3 more
Object detection in challenging imaging domains like security screening, medical analysis, and satellite imaging is often hindered by signal degradation (e.g., noise, blur) and spatial ambiguity (e.g., occlusion, extreme scale variation). We argue that many standard architectures fail by fusing multi-scale features prematurely, which amplifies noise. This paper introduces the Enhance-Fuse-Align (E-F-A) principle: a new architectural blueprint positing that robust feature enhancement and explicit spatial alignment are necessary preconditions for effective feature fusion. We implement this blueprint in a model named SecureDet, which instantiates each stage: (1) an RFCBAMConv module for feature Enhancement; (2) a BiFPN for weighted Fusion; (3) ECFA and ASFA modules for contextual and spatial Alignment. To validate the E-F-A blueprint, we apply SecureDet to the highly challenging task of X-ray contraband detection. Extensive experiments and ablation studies demonstrate that the mandated E-F-A sequence is critical to performance, significantly outperforming both the baseline and incomplete or improperly ordered architectures. In practice, enhancement is applied prior to fusion to attenuate noise and blur that would otherwise be amplified by cross-scale aggregation, and final alignment corrects mis-registrations to avoid sampling extraneous signals from occluding materials.
- New
- Research Article
- 10.1177/03611981251367708
- Oct 27, 2025
- Transportation Research Record: Journal of the Transportation Research Board
- Lingkun Chen + 4 more
Utilizing a crack segmentation model based on a convolutional neural network (CNN) and Transformer for crack recognition has been a focal point recently in research on road damage identification. However, because of the limited global information processing capability of CNN models and the inadequate local feature recognition ability of Transformer models, the performance of the model in crack recognition under complex environments is suboptimal. Simultaneously, the challenges of larger model parameter sizes and lower computational efficiency impede progress in crack recognition tasks. Addressing these issues, this paper proposes a framework named Parallel Flatten Swin-VanillaNet (PFSV), which integrates Flatten Swin Transformer and VanillaNet. The framework employs upsampling to extract multiscale features from the intermediate layers of the encoder for decoding. The results demonstrate that, compared with DeepLabV3+, PSPNet, FPN, SETR, SegFormer, and DeepCrack, the PFSV model achieves improvements across all evaluation metrics. In addition, the number of parameters is reduced by 35.56% to 50.19%, and frames per second and floating-point operations per second values surpass those of the comparative models. The proposed PFSV model exhibits robust crack detection capabilities and superior computational efficiency.
- New
- Research Article
- 10.3389/fpls.2025.1670790
- Oct 27, 2025
- Frontiers in Plant Science
- Hanyun Lu + 5 more
Introduction Accurate fruit detection under low-visibility conditions such as fog, rain, and low illumination is crucial for intelligent orchard management and robotic harvesting. However, most existing detection models experience significant performance degradation in these visually challenging environments. Methods This study proposes a modular detection framework named Dynamic Coding Network (DCNet), designed specifically for robust fruit detection in low-visibility agricultural scenes. DCNet comprises four main components: a Dynamic Feature Encoder for adaptive multi-scale feature extraction, a Global Attention Gate for contextual modeling, a Cross-Attention Decoder for fine-grained feature reconstruction, and an Iterative Feature Attention mechanism for progressive feature refinement. Results Experiments on the LVScene4K dataset, which contains multiple fruit categories (grape, kiwifruit, orange, pear, pomelo, persimmon, pumpkin, and tomato) under fog, rain, low light, and occlusion conditions, demonstrate that DCNet achieves 86.5% mean average precision and 84.2% intersection over union. Compared with state-of-the-art methods, DCNet improves F1 by 3.4% and IoU by 4.3%, maintaining a real-time inference speed of 28 FPS on an RTX 3090 GPU. Discussion The results indicate that DCNet achieves a superior balance between detection accuracy and computational efficiency, making it well-suited for real-time deployment in agricultural robotics. Its modular architecture also facilitates generalization to other crops and complex agricultural environments.
- New
- Research Article
- 10.1088/2631-8695/ae13da
- Oct 27, 2025
- Engineering Research Express
- Kaibo Yang + 3 more
Abstract Traffic sceSne perception involves the interrelated tasks of 3D object detection, semantic segmentation, and depth estimation. To concurrently learn multi-scale task-general, task-specific, and cross-task complementary features, a multi-task traffic scene perception algorithm based on a visual prompter is developed. The algorithm introduces a multi-scale prompt learning method to obtain rich multi-scale feature maps and prompt words. A prompt selector module and an up-sample aggregate module are designed to process the output feature maps and prompt words effectively, aggregating the feature maps to a uniform scale. Additionally, a prompt teaching method enhances efficient multi-task learning. Experimental results on the Cityscapes-3D dataset demonstrate that the proposed algorithm outperforms current mainstream algorithms, achieving an mDS of 33.1% for 3D object detection, an mIoU of 79.9% for semantic segmentation, and an RMSE of 4.20 for depth estimation.
- New
- Research Article
- 10.1088/2057-1976/ae13b5
- Oct 27, 2025
- Biomedical Physics & Engineering Express
- Liang Wang + 5 more
Accurate segmentation of glioblastoma (GBM), including the whole tumor (WT), tumor core (TC), and enhancing tumor (ET), from multi-modal magnetic resonance images (MRI) is essential for precise Tumor Treating Fields (TTFields) simulation. This study aims to address the challenges of this segmentation task to improve the accuracy of TTFields simulation results. We propose enhanced nnUnet (EnnUnet), a novel framework for multi-modal MRI segmentation that enhances the robust and widely-used nnUnet architecture. This advanced architecture integrates three key innovations: (1) Generalized Multi-kernel Convolution blocks are incorporated to capture multi-scale features and long-range dependencies. (2) A dual attention mechanism is employed at skip connections to refine feature fusion. (3) A novel boundary and Top-K loss is implemented for boundary-based refinement and to focus the training process on hard-to-segment pixels. The effectiveness of each enhancement was systematically evaluated through an ablation study on the BraTS 2023 dataset. The final EnnUnet model achieved superior performance, with average Dice scores of 93.52%, 92.07%, and 87.60% for the WT, TC, and ET, respectively, consistently outperforming other state-of-the-art methods. Furthermore, TTFields simulations on real patient data demonstrated that our precise segmentations yield more realistic electric field distributions compared to simplified homogeneous tumor models. The proposed EnnUnet architecture showcases promising potential for highly accurate and robust glioma segmentation. It offers a more reliable foundation for computational modeling, which is essential for enhancing the precision of TTFields treatment planning and advancing personalized therapeutic strategies for GBM patients.
- New
- Research Article
- 10.3390/pr13113434
- Oct 26, 2025
- Processes
- Zhongwei Zhu + 6 more
Accurate prediction of fracturing pressure is critical for operational safety and fracturing efficiency in unconventional reservoirs. Traditional physics-based models and existing deep learning architectures often struggle to capture the intense fluctuations and complex temporal dependencies observed in actual fracturing operations. To address these challenges, this paper proposes a multi-source data-driven fracturing pressure prediction model, a model integrating TCN-BiLSTM-Attention Mechanism (Temporal Convolutional Network, Bidirectional Long Short-Term Memory, Attention Mechanism), and introduces a feature selection mechanism for fracture pressure prediction. This model employs TCN to extract multi-scale local fluctuation features, BiLSTM to capture long-term dependencies, and Attention to adaptively adjust feature weights. A two-stage feature selection strategy combining correlation analysis and ablation experiments effectively eliminates redundant features and enhances model robustness. Field data from the Sichuan Basin were used for model validation. Results demonstrate that our method significantly outperforms baseline models (LSTM, BiLSTM, and TCN-BiLSTM) in mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R2), particularly under high-fluctuation conditions. When integrated with slope reversal analysis, it achieves sand blockage warnings up to 41 s in advance, offering substantial potential for real-time decision support in fracturing operations.
- New
- Research Article
- 10.3390/ijgi14110418
- Oct 26, 2025
- ISPRS International Journal of Geo-Information
- Wenchao Fan + 4 more
Image-based geo-localization is a challenging task that aims to determine the geographic location of a ground-level query image captured by an Unmanned Ground Vehicle (UGV) by matching it to geo-tagged nadir-view (top-down) images from an Unmanned Aerial Vehicle (UAV) stored in a reference database. The challenge comes from the perspective inconsistency between matched objects. In this work, we propose a novel metric learning scheme for hard exemplar mining to improve the performance of cross-view geo-localization. Specifically, we introduce a Dynamic Online Cross-Batch (DOCB) hard exemplar mining scheme that solves the problem of the lack of hard exemplars in mini-batches in the middle and late stages of training, which leads to training stagnation. It mines cross-batch hard negative exemplars according to the current network state and reloads them into the network to make the gradient of negative exemplars participating in back-propagation. Since the feature representation of cross-batch negative examples adapts to the current network state, the triplet loss calculation becomes more accurate. Compared with methods only considering the gradient of anchors and positives, adding the gradient of negative exemplars helps us to obtain the correct gradient direction. Therefore, our DOCB scheme can better guide the network to learn valuable metric information. Moreover, we design a simple Siamese-like network called multi-scale feature aggregation (MSFA), which can generate multi-scale feature aggregation by learning and fusing multiple local spatial embeddings. The experimental results demonstrate that our DOCB scheme and MSFA network achieve an accuracy of 95.78% on the CVUSA dataset and 86.34% on the CVACT_val dataset, which outperforms those of other existing methods in the field.
- New
- Research Article
- 10.3390/jimaging11110376
- Oct 26, 2025
- Journal of Imaging
- Zhongmin Liu + 2 more
To address global semantic loss, local detail blurring, and spatial–semantic conflict during image restoration under adverse weather conditions, we propose an image restoration network that integrates Mamba with Transformer architectures. We first design a Vision-Mamba–Transformer (VMT) module that combines the long-range dependency modeling of Vision Mamba with the global contextual reasoning of Transformers, facilitating the joint modeling of global structures and local details, thus mitigating information loss and detail blurring during restoration. Second, we introduce an Adaptive Content Guidance (ACG) module that employs dynamic gating and spatial–channel attention to enable effective inter-layer feature fusion, thereby enhancing cross-layer semantic consistency. Finally, we embed the VMT and ACG modules into a U-Net backbone, achieving efficient integration of multi-scale feature modeling and cross-layer fusion, significantly improving reconstruction quality under complex weather conditions. The experimental results show that on Snow100K-S/L, VMT-Net improves PSNR over the baseline by approximately 0.89 dB and 0.36 dB, with SSIM gains of about 0.91% and 0.11%, respectively. On Outdoor-Rain and Raindrop, it performs similarly to the baseline and exhibits superior detail recovery in real-world scenes. Overall, the method demonstrates robustness and strong detail restoration across diverse adverse-weather conditions.
- New
- Research Article
- 10.36001/phmconf.2025.v17i1.4310
- Oct 26, 2025
- Annual Conference of the PHM Society
- Munsu Jeon + 3 more
This study proposes a novel autonomous inspection strategy for insulator strings using a drone. The proposed method not only optimizes the viewpoint of an optical camera for acquiring high-quality images of insulator strings but also detect anomalies of insulator strings from the acquired images. The proposed method features three key characteristics. First, an adaptive flight strategy is proposed based on the spatial configuration of transmission facilities. Specifically, the type of transmission tower is classified as either suspension or strain by analyzing the orientation of the insulator strings detected from optical images. Key structural features of transmission facilities are then extracted from point cloud data by addressing effective signal processing methods including random sample consensus, Euclidean distance clustering, and probabilistic downsampling. This feature enables the drone dynamically adjust the heading, altitude, and camera tilt to acquire optimal images of insulator strings. Second, a novel architecture of a deep neural network is proposed to detect defects in insulator strings based on the acquired images of insulator strings. Specifically, the architecture of the proposed network combines a multi-scale variational autoencoder and a lightweight classifier for anomaly detection. The variational autoencoder reconstructs normal insulator images at multiple scales to acquire hierarchical features, and the classifier distinguishes between normal and defective patterns by utilizing the extracted multi-scale features. Third, synthetic images of insulator strings are generated to mitigate a concern on the data imbalance between normal and abnormal images of insulator strings. Specifically, 3D models of insulator strings are constructed by using computer-aided design tools, and fault patterns are embedded to generate abnormal samples. 2D synthetic images are then rendered under varying viewpoints, lighting conditions, and backgrounds. Additionally, a generative adversarial network is addressed to produce realistic defect images to enhance the diversity of abnormal samples. These synthetic images contribute to improving the robustness of the proposed anomaly detection network. Systematic analyses conducted in both virtual and real-world environments show the effectiveness of the proposed method. The adaptive flight mission was successfully completed to acquire high-quality images of insulator strings without visual overlap between adjacent insulator strings. The proposed network achieves classification accuracy of 95.0% in distinguishing between normal and abnormal insulator strings for anomaly detection. The proposed strategy not only improves the performance of autonomous inspection but also enhances operational safety by reducing reliance on manual inspection in hazardous environments.
- New
- Research Article
- 10.7507/1001-5515.202412012
- Oct 25, 2025
- Sheng wu yi xue gong cheng xue za zhi = Journal of biomedical engineering = Shengwu yixue gongchengxue zazhi
- Wen Guo + 4 more
Colorectal polyps are important early markers of colorectal cancer, and their early detection is crucial for cancer prevention. Although existing polyp segmentation models have achieved certain results, they still face challenges such as diverse polyp morphology, blurred boundaries, and insufficient feature extraction. To address these issues, this study proposes a parallel coordinate fusion network (PCFNet), aiming to improve the accuracy and robustness of polyp segmentation. PCFNet integrates parallel convolutional modules and a coordinate attention mechanism, enabling the preservation of global feature information while precisely capturing detailed features, thereby effectively segmenting polyps with complex boundaries. Experimental results on Kvasir-SEG and CVC-ClinicDB demonstrate the outstanding performance of PCFNet across multiple metrics. Specifically, on the Kvasir-SEG dataset, PCFNet achieved an F1-score of 0.897 4 and a mean intersection over union (mIoU) of 0.835 8; on the CVC-ClinicDB dataset, it attained an F1-score of 0.939 8 and an mIoU of 0.892 3. Compared with other methods, PCFNet shows significant improvements across all performance metrics, particularly in multi-scale feature fusion and spatial information capture, demonstrating its innovativeness. The proposed method provides a more reliable AI-assisted diagnostic tool for early colorectal cancer screening.
- New
- Research Article
- 10.1080/15481603.2025.2565866
- Oct 25, 2025
- GIScience & Remote Sensing
- Zhifu Zhu + 6 more
Generative adversarial networks (GANs) possess powerful image translation capabilities. They can transform images acquired from different sensors into a unified domain, effectively mitigating the incomparability problem caused by imaging discrepancies in multimodal remote sensing change detection (CD). However, existing approaches predominantly emphasize domain unification while neglecting the loss of fine-grained features inherent in the translation process, consequently compromising both image translation quality and CD accuracy. To overcome these limitations, we propose a novel texture and structure interaction guided GAN (TSIG-GAN). This network establishes interactive guidance between image texture and structural features through a carefully designed dual-stream cross encoder-decoder architecture, enabling in-depth mining of fine-grained features and significantly improving the fidelity of translated images. Furthermore, to address the spatial scale diversity and complexity of remote sensing images, we develop a multi-scale adaptive feature pyramid (MAFP) module and a contextual semantic interaction guidance (CSIG) mechanism, aiming to further strengthen the model's robust representation of fine-grained features across multiple scales and complex scenes. Specifically, the MAFP module effectively captures spatial details of targets at different resolutions by dynamically integrating multi-scale features, thereby preventing detail loss in small objects due to scale discrepancies. The CSIG mechanism achieves deep interaction between texture and structural features at the contextual semantic level, further promoting their mutual cooperation, thereby enhancing the consistency of fine-grained features representation and semantic integrity in complex scenes. Finally, the translated fine-grained images are fed into a custom CD network to extract changes. To evaluate the effectiveness of the proposed method, we conducted systematic experiments on five representative real-world datasets and performed comparative analysis with sixteen state-of-the-art multimodal CD methods. The experimental results demonstrate that TSIG-GAN achieves significant improvements in both image translation and CD performance, exhibiting superior fine-grained restoration capability and change identification capability.
- New
- Research Article
- 10.1007/s11760-025-04902-1
- Oct 25, 2025
- Signal, Image and Video Processing
- Yandong Hou + 5 more
A novel multi-scale adaptive feature fusion framework for accurate and efficient bearing fault diagnosis
- New
- Research Article
- 10.1007/s13369-025-10755-0
- Oct 25, 2025
- Arabian Journal for Science and Engineering
- Xu Zhang + 3 more
A Dual-dimensional Parallel Neural Network Integrating Multi-scale and Frequency-Domain Features for Aircraft Engine Life Prediction
- New
- Research Article
- 10.3390/s25216574
- Oct 25, 2025
- Sensors
- Yi Liu + 5 more
Aero-engine ablation detection is a critical task in aircraft health management, yet existing rotation-based object detection methods often face challenges of high computational complexity and insufficient local feature extraction. This paper proposes an improved YOLOv11 algorithm incorporating Context-guided Large-kernel attention and Rotated detection head, called CLR-YOLOv11. The model achieves synergistic improvement in both detection efficiency and accuracy through dual structural optimization, with its innovations primarily embodied in the following three tightly coupled strategies: (1) Targeted Data Preprocessing Pipeline Design: To address challenges such as limited sample size, low overall image brightness, and noise interference, we designed an ordered data augmentation and normalization pipeline. This pipeline is not a mere stacking of techniques but strategically enhances sample diversity through geometric transformations (random flipping, rotation), hybrid augmentations (Mixup, Mosaic), and pixel-value transformations (histogram equalization, Gaussian filtering). All processed images subsequently undergo Z-Score normalization. This order-aware pipeline design effectively improves the quality, diversity, and consistency of the input data. (2) Context-Guided Feature Fusion Mechanism: To overcome the limitations of traditional Convolutional Neural Networks in modeling long-range contextual dependencies between ablation areas and surrounding structures, we replaced the original C3k2 layer with the C3K2CG module. This module adaptively fuses local textural details with global semantic information through a context-guided mechanism, enabling the model to more accurately understand the gradual boundaries and spatial context of ablation regions. (3) Efficiency-Oriented Large-Kernel Attention Optimization: To expand the receptive field while strictly controlling the additional computational overhead introduced by rotated detection, we replaced the C2PSA module with the C2PSLA module. By employing large-kernel decomposition and a spatial selective focusing strategy, this module significantly reduces computational load while maintaining multi-scale feature perception capability, ensuring the model meets the demands of high real-time applications. Experiments on a self-built aero-engine ablation dataset demonstrate that the improved model achieves 78.5% mAP@0.5:0.95, representing a 4.2% improvement over the YOLOv11-obb which model without the specialized data augmentation. This study provides an effective solution for high-precision real-time aviation inspection tasks.
- New
- Research Article
- 10.1080/10589759.2025.2577840
- Oct 25, 2025
- Nondestructive Testing and Evaluation
- Guanghu Liu + 3 more
ABSTRACT Accurate detection of steel-surface defects is essential for industrial quality control. However, scarcity of defective samples, pronounced morphological variability, and prohibitive annotation costs severely limit the performance of conventional supervised approaches. To address this limitation, we propose SD-INPFormer, an unsupervised anomaly detection model built upon an enhanced INPFormer architecture. By dynamically extracting intrinsic normal prototypes (INPs) from each test image, the model entirely circumvents the need for external annotations. Specifically, we propose an adaptive weighted multi-scale feature fusion mechanism that preserves defect-related cues across all encoder scales. We introduce the MambaMixer module to suppress background texture noise and to amplify subtle defect responses. A MemoryMLP is further proposed to prevent the encoding of anomalous features, while multi-scale gated cross attention is employed to extract INPs at multiple scales. In addition, we devise multi-centre prototype attention to encompass diverse defect patterns. SD-INPFormer’s excellent performance is empirically validated on the SteelAD and SeverstalAD datasets, providing a high-precision solution for automated steel-surface quality inspection. The implementation is publicly available at https://github.com/ghlerrix/SD-INPformer.
- New
- Research Article
- 10.3390/en18215617
- Oct 25, 2025
- Energies
- Xiangdong Meng + 7 more
Driven by the rapid promotion of new energy technologies, lithium-ion batteries have found broad applications. Accurate prediction of their state of health (SOH) plays a critical role in ensuring safe and reliable battery management. This study presents a hybrid SOH prediction method for lithium-ion batteries by combining improved complete ensemble empirical mode decomposition with adaptive noise (ICEEMDAN) and a fully connected bidirectional long short-term memory network (FC-BiLSTM). ICEEMDAN is applied to extract multi-scale features and suppress noise, while the FC-BiLSTM integrates feature mapping with temporal modeling for accurate prediction. Using end-of-discharge time, charging capacity, and historical capacity averages as inputs, the method is validated on the NASA dataset and laboratory aging data. Results show RMSE values below 0.012 and over 15% improvement compared with BiLSTM-based benchmarks, highlighting the proposed method’s accuracy, robustness, and potential for online SOH prediction in electric vehicle battery management systems.
- New
- Research Article
- 10.3390/s25216573
- Oct 25, 2025
- Sensors
- Ruizhi Zhang + 5 more
Real-time object detection in Unmanned Aerial Vehicle (UAV) imagery is critical yet challenging, requiring high accuracy amidst complex scenes with multi-scale and small objects, under stringent onboard computational constraints. While existing methods struggle to balance accuracy and efficiency, we propose RTUAV-YOLO, a family of lightweight models based on YOLOv11 tailored for UAV real-time object detection. First, to mitigate the feature imbalance and progressive information degradation of small objects in current architectures multi-scale processing, we developed a Multi-Scale Feature Adaptive Modulation module (MSFAM) that enhances small-target feature extraction capabilities through adaptive weight generation mechanisms and dual-pathway heterogeneous feature aggregation. Second, to overcome the limitations in contextual information acquisition exhibited by current architectures in complex scene analysis, we propose a Progressive Dilated Separable Convolution Module (PDSCM) that achieves effective aggregation of multi-scale target contextual information through continuous receptive field expansion. Third, to preserve fine-grained spatial information of small objects during feature map downsampling operations, we engineered a Lightweight DownSampling Module (LDSM) to replace the traditional convolutional module. Finally, to rectify the insensitivity of current Intersection over Union (IoU) metrics toward small objects, we introduce the Minimum Point Distance Wise IoU (MPDWIoU) loss function, which enhances small-target localization precision through the integration of distance-aware penalty terms and adaptive weighting mechanisms. Comprehensive experiments on the VisDrone2019 dataset show that RTUAV-YOLO achieves an average improvement of 3.4% and 2.4% in mAP50 and mAP50-95, respectively, compared to the baseline model, while reducing the number of parameters by 65.3%. Its generalization capability for UAV object detection is further validated on the UAVDT and UAVVaste datasets. The proposed model is deployed on a typical airborne platform, Jetson Orin Nano, providing an effective solution for real-time object detection scenarios in actual UAVs.
- New
- Research Article
- 10.3390/fire8110413
- Oct 25, 2025
- Fire
- Tina Samavat + 3 more
Smoke detection is a practical approach for early identification of wildfires and mitigating hazards that affect ecosystems, infrastructure, property, and the community. The existing deep learning (DL) object detection methods (e.g., Detection Transformer (DETR)) have demonstrated significant potential for early awareness of these events. However, their precision is influenced by the low visual salience of smoke and the reliability of the annotation, and collecting real-world and reliable datasets with precise annotations is a labor-intensive and time-consuming process. To address this challenge, we propose a weakly supervised Transformer-based approach with a teacher–student architecture designed explicitly for smoke detection while reducing the need for extensive labeling efforts. In the proposed approach, an expert model serves as the teacher, guiding the student model to learn from a variety of data annotations, including bounding boxes, point labels, and unlabeled images. This adaptability reduces the dependency on exhaustive manual annotation. The proposed approach integrates a Deformable-DETR backbone with a modified loss function to enhance the detection pipeline by improving spatial reasoning, supporting multi-scale feature learning, and facilitating a deeper understanding of the global context. The experimental results demonstrate performance comparable to, and in some cases exceeding, that of fully supervised models, including DETR and YOLOv8. Moreover, this study expands the existing datasets to offer a more comprehensive resource for the research community.
- New
- Research Article
- 10.1007/s41870-025-02790-9
- Oct 25, 2025
- International Journal of Information Technology
- Aabidah Nazir + 1 more
Multi-scale feature enhancement using EfficientNet-B7 and PANet in faster R-CNN for small object detection