Published in last 50 years
Articles published on Multi-scale Features
- New
- Research Article
- 10.54254/2755-2721/2025.ld29184
- Nov 5, 2025
- Applied and Computational Engineering
- Tiancheng Hu
In recent years, algorithms in the field of computer vision have been continuously innovated and promoted, and the progress of small object detection has become a key task in the development of this field. However, compared with the detection of medium and large targets, factors such as background interference can easily interfere with the detection of small targets with smaller pixel coverage areas, making progress more difficult. In recent years, researchers have proposed various methods to address these challenges, and the three most representative frameworks are algorithms developed using YOLO, Transformer, and Diffusion models. This article provides a detailed overview and comparison of three models. The YOLO based method is superior in improving real-time detection through multi-scale feature enhancement, structural optimization, and adjusting the loss function. Based on the Transformer, the accuracy and precision of identifying small targets are improved by adjusting the mechanism, using a hybrid structure and multimodal feature fusion. And researchers will adjust the diffusion process, involving the construction of diffusion bounding boxes and diffusion engines, to enable the application of diffusion model algorithms. Finally, this article summarizes the advantages and limitations of these methods and discusses potential future research directions. The significance of this study lies in providing a unified overview of the three main research paradigms, helping researchers understand current progress, identify existing challenges, and explore new possibilities for advancing small object detection.
- New
- Research Article
- 10.3390/biomimetics10110743
- Nov 5, 2025
- Biomimetics
- Chunjiang Wu + 5 more
Natural gas pipeline leak monitoring suffers from severe environmental noise, non-stationary signals, and complex multi-source variable couplings, limiting prediction accuracy and robustness. Inspired by biological perceptual systems, particularly their multimodal integration and dynamic attention allocation, we propose GL-TransLSTM, a biomimetic hybrid deep learning model. It synergistically combines Transformer’s global self-attention (emulating selective focus) and LSTM’s gated memory (mimicking neural temporal retention). The architecture incorporates a multimodal fusion pipeline; raw sensor data are first decomposed via CEEMDAN to extract multi-scale features, then processed by an enhanced LSTM-Transformer backbone. A novel physics-informed gated attention mechanism embeds gas diffusion dynamics into attention weights, while an adaptive sliding window adjusts temporal granularity. This study makes evaluations on an industrial dataset with methane concentration, temperature, and pressure, GL-TransLSTM achieves 99.93% accuracy, 99.86% recall, and 99.89% F1-score, thereby significantly outperforming conventional LSTM and Transformer-LSTM baselines. Experimental results demonstrate that the proposed biomimetic framework substantially enhances modeling capacity and generalization for non-stationary signals in noisy and complex industrial environments through multi-scale fusion, physics-guided learning, and bio-inspired architectural synergy.
- New
- Research Article
- 10.5194/isprs-archives-xlviii-1-w5-2025-193-2025
- Nov 5, 2025
- The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
- Cong Zhou + 5 more
Abstract. Point cloud completion, a critical task in 3D vision, aims to repair incomplete point cloud data caused by sensor limitations or environmental occlusions, thereby providing complete 3D structural information for downstream applications. Most existing methods employ global generation strategies to directly output complete point clouds, but these approaches frequently alter the original geometric structures, resulting in detail loss or increased noise. A novel attention-based multi-scale point cloud completion network is proposed to overcome these limitations. The first enhancement introduces a channel attention mechanism during multi-scale feature fusion, which strengthens the coordinated expression of local details and global semantics through adaptive weight allocation. The second improvement designs a hybrid loss function that combines Wasserstein GAN with gradient penalty and geometric consistency constraints, thereby enhancing both detail authenticity and structural coherence in generated point clouds. Experiments conducted on the ShapeNet-Part dataset demonstrate the effectiveness of the proposed method. The improved approach achieves a reduction in Chamfer Distance compared to PF-Net, with particularly enhanced robustness observed in completing complex structures such as hollow chair backs and thin-walled lampshades. These results validate the superiority of the proposed technical innovations in geometric detail preservation and structural integrity maintenance.
- New
- Research Article
- 10.5194/isprs-archives-xlviii-1-w5-2025-133-2025
- Nov 5, 2025
- The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
- Qixuan Wang + 3 more
Abstract. The use of remote sensing images for land cover classification is crucial for environmental monitoring, urban planning, and sustainable resource management. Despite advances in deep learning, existing methods suffer from blurred boundaries in complex landscapes and perform poorly in identifying small or overlapping land cover categories. This article introduces MultiTrans LC, a novel multimodal fusion framework that integrates visual language interaction and boundary perception optimization to address these challenges. The proposed architecture utilizes a hierarchical Transformer encoder to extract global visual features from high-resolution images and aligns them with semantic embeddings in text prompts through cross modal attention. The visual language decoder further refines the multi-scale feature representation through progressive fusion, while the edge aware loss function jointly optimizes pixel level classification and boundary localization. Experiments on three benchmark datasets (GID-15, LoveDA, RSSCN7) have demonstrated state-of-the-art performance, achieving an overall accuracy of 90.7% and a Kappa coefficient of 0.901 on GID-15, which is 1.6% higher than the leading method in OA. Visualization confirms that MultiTrans LC performs well compared to CNN and Transformer baselines. By bridging visual and textual semantics, MultiTrans LC improves the accuracy of large-scale land cover mapping and provides a powerful solution for geospatial intelligence applications. Discussed the limitations and future directions of open vocabulary classification and edge device deployment.
- New
- Research Article
- 10.3390/rs17213645
- Nov 5, 2025
- Remote Sensing
- Weitong Ma + 6 more
Island building change detection is a critical technology for environmental monitoring, disaster early warning, and urban planning, playing a key role in dynamic resource management and sustainable development of islands. However, the imbalanced distribution of class pixels (changed vs. unchanged) undermines the detection capability of existing methods and severe boundary misdetection. To address issue, we propose the MSDT-Net model, which makes breakthroughs in architecture, modules, and loss functions; a dual-branch twin ConvNeXt architecture is adopted as the feature extraction backbone, and the designed Edge-Aware Smoothing Module (MSA) effectively enhances the continuity of the change region boundaries through a multi-scale feature fusion mechanism. The proposed Difference Feature Enhancement Module (DTEM) enables deep interaction and fusion between original semantic and change features, significantly improving the discriminative power of the features. Additionally, a Focal–Dice–IoU Boundary Joint Loss Function (FDUB-Loss) is constructed to suppress massive background interference using Focal Loss, enhance pixel-level segmentation accuracy with Dice Loss, and optimize object localization with IoU Loss. Experiments show that on a self-constructed island dataset, the model achieves an F1-score of 0.9248 and an IoU value of 0.8614. Compared to mainstream methods, MSDT-Net demonstrates significant improvements in key metrics across various aspects. Especially in scenarios with few changed pixels, the recall rate is 0.9178 and the precision is 0.9328, showing excellent detection performance and boundary integrity. The introduction of MSDT-Net provides a highly reliable technical pathway for island development monitoring.
- New
- Research Article
- 10.3390/app152111778
- Nov 5, 2025
- Applied Sciences
- Shuai Liu + 2 more
Wheat is one of the world’s essential crops, and the presence of foliar diseases significantly affects both the yield and quality of wheat. Accurate identification of wheat leaf diseases is crucial. However, traditional segmentation models face challenges such as low segmentation accuracy, limiting their effectiveness in leaf disease control. To address these issues, this study proposes MSDP-SAM2-UNet, an efficient model for wheat leaf disease segmentation. Based on the SAM2-UNet network, we achieve multi-scale feature fusion through a dual-path multi-branch architecture, enhancing the model’s ability to capture global information and thereby improving segmentation performance. Additionally, we introduce an attention mechanism to strengthen residual connections, enabling the model to precisely distinguish targets from backgrounds and achieve greater robustness and higher segmentation accuracy. The experiments demonstrate MSDP-SAM2-UNet achieves outstanding performance across multiple metrics, including pixel accuracy (PA) of 94.02%, mean pixel accuracy (MPA) of 88.44%, mean intersection over union (MIoU) of 82.43%, frequency weighted intersection over union (FWIoU) of 90.73%, Dice coefficient of 81.76%, and precision of 81.63%. Compared to the SAM2-UNet, these metrics improved by 2.04%, 2.76%, 4.1%, 2.06%, 4.9%, and 3.6%, respectively. The results validate that MSDP-SAM2-UNet have tremendous segmentation performance and offer a novel perspective for wheat leaf disease segmentation.
- New
- Research Article
- 10.3390/computers14110480
- Nov 4, 2025
- Computers
- Jiayi Wang + 6 more
Currently, the increasing number of Internet of Things devices makes spectrum resource shortage prominent. Spectrum sensing technology can effectively solve this problem by conducting real-time monitoring of the spectrum. However, in practical applications, it is difficult to obtain a large number of labeled samples, which leads to the neural network model not being fully trained and affects the performance. Moreover, the existing few-shot methods focus on capturing spatial features, ignoring the representation forms of features at different scales, thus reducing the diversity of features. To address the above issues, this paper proposes a few-shot spectrum sensing method based on multi-scale global feature. To enhance the feature diversity, this method employs a multi-scale feature extractor to extract features at multiple scales. This improves the model’s ability to distinguish signals and avoids overfitting of the network. In addition, to make full use of the frequency features at different scales, a learnable weight feature reinforcer is constructed to enhance the frequency features. The simulation results show that, when SNR is under 0∼10 dB, the recognition accuracy of the network under different task modes all reaches above 81%, which is better than the existing methods. It realizes the accurate spectrum sensing under the few-shot conditions.
- New
- Research Article
- 10.3390/foods14213780
- Nov 4, 2025
- Foods
- Zhiwei Chen + 5 more
To address the challenge of monitoring the postharvest jasmine bloom stages during industrial tea scenting processes, this study proposes an efficient U-shaped Network (U-Net) model with frequency–spatial cross-attention (FSCA-EUNet) to resolve critical bottlenecks, including repetitive backgrounds and small interclass differences, caused by stacked jasmine flowers during factory production. High-resolution images of stacked jasmine flowers were first preprocessed and input into FSCA-EUNet, where the encoder extracted multi-scale spatial features and the FSCA module incorporated frequency-domain textures. The decoder then fused and refined these features, and the final classification layer output the predicted bloom stage for each image. The proposed model was designed as a “U-Net”-like structure to preserve multiscale details and employed a frequency–spatial cross-attention module to extract high-frequency texture features via a discrete cosine transform. Long-range dependencies were established by NonLocalBlook, located after the encoders in the model. Finally, a momentum-updated center loss function was introduced to constrain the feature space distribution and enhance intraclass compactness. According to the experimental results, the proposed model achieved the best metrics, including 95.52% precision, 95.42% recall, 95.40% F1-score, and 97.24% mean average precision, on our constructed dataset with only 878.851 K parameters and 15.445 G Floating Point Operations (FLOPs), and enabled real-time deployment at 22.33 FPS on Jetson Orin NX edge devices. The ablation experiments validated the improvements contributed by each module, which significantly improved the fine-grained classification capability of the proposed network. In conclusion, FSCA-EUNet effectively addresses the challenges of stacked flower backgrounds and subtle interclass differences, offering a lightweight yet accurate framework that enables real-time deployment for industrial jasmine tea scenting automation.
- New
- Research Article
- 10.3390/app152111738
- Nov 4, 2025
- Applied Sciences
- Hehuan Li + 4 more
Hyperspectral images (HSIs) are crucial for ground object classification, target detection, and related applications due to their rich spatial spectral information. However, hardware limitations in imaging systems make it challenging to directly acquire HSIs with a high spatial resolution. While deep learning-based single hyperspectral image super-resolution (SHSR) methods have made significant progress, existing approaches primarily rely on convolutional neural networks (CNNs) with fixed geometric kernels, which struggle to model global spatial spectral dependencies effectively. To address this, we propose ESSTformer, a novel SHSR framework that synergistically integrates CNNs’ local feature extraction and Transformers’ global modeling capabilities. Specifically, we design a multi-scale spectral attention module (MSAM) based on dilated convolutions to capture local multi-scale spatial spectral features. Considering the inherent differences between spatial and spectral information, we adopt a decoupled processing strategy by constructing separate spatial and Spectral Transformers. The Spatial Transformer employs window attention mechanisms and an improved convolutional multi-layer perceptron (CMLP) to model long-range spatial dependencies, while the Spectral Transformer utilizes self-attention mechanisms combined with a spectral enhancement module to focus on discriminative spectral features. Extensive experiments on three hyperspectral datasets demonstrate that the proposed ESSTformer achieves a superior performance in super-resolution reconstruction compared to state-of-the-art methods.
- New
- Research Article
- 10.1038/s41598-025-22419-y
- Nov 4, 2025
- Scientific Reports
- Dongren Liu + 3 more
Pediatric wrist fractures are common skeletal injuries in clinical practice; however, due to the ongoing development of children’s bones, fracture characteristics are complex and often prone to misdiagnosis or missed diagnosis. Moreover, traditional diagnostic methods rely heavily on the physician’s experience, which may compromise efficiency and accuracy, especially in environments with limited medical resources. To address this issue, this study proposes an improved deep learning detection method based on YOLO11s, named Kid-YOLO, for the automatic detection of pediatric wrist fractures in X-ray images. By introducing the C3k2-WTConv module and Focaler-MPDIoU loss function, the model was improved in terms of multi-scale feature extraction, target box localization accuracy optimization, and addressing the class imbalance problem. The C3k2-WTConv module, which combines wavelet transform and convolution operations, effectively enhances the model’s ability to detect subtle fractures and complex patterns. The Focaler-MPDIoU loss function improves performance in detecting rare targets by dynamically adjusting sample weight distribution and optimizing prediction box positioning. Experiments were conducted on the publicly available GRAZPEDWRI-DX dataset after data cleaning, The results show that, compared with the YOLO11 model, the improved model achieves a 3.2% increase in precision, a 1.6% increase in recall, a 1.8% improvement in mAP@50, and a 3.2% improvement in mAP@50–95. Furthermore, this study developed an AI-assisted diagnostic system with an integrated graphical user interface, capable of efficiently performing image loading, fracture detection, and result visualization, thereby providing physicians with a reliable diagnostic tool. In the future, this method is expected to be applied to a broader range of medical imaging analysis tasks, offering new technical support for precision medicine.
- New
- Research Article
- 10.5194/isprs-annals-x-1-w2-2025-173-2025
- Nov 4, 2025
- ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
- Zhisen Wang + 7 more
Abstract. The seedling emergence rate is a crucial indicator for evaluating the growth status of crops in agricultural production and can provide valuable recommendations for subsequent crop planting and field management strategies. Currently, the determination of the emergence rate relies on manual seedling counting, which is not only labour-intensive and time-consuming, but also prone to human errors. Therefore, we utilize drone-captured images of peanut seedlings and employs deep learning networks to estimate seedling numbers. Specifically, we incorporate the BIFPN (Bidirectional Feature Pyramid Network) feature fusion module into the original Centernet model, which would combine multi-scale feature information. This modification not only enhances the accuracy of identification but also improves the localization of seedlings. To address the issue of false positives caused by complex field backgrounds in seedling recognition, we integrate the Contrastive Loss module to increase the discrepancy between positive and negative samples. The results demonstrate that the proposed method significantly enhances both precision and recall rates for peanut seedling recognition under three different scenes, compared to the original model. Furthermore, the proposed method is also applied in real peanut breading field, fulfilling the practical requirements for emergence rate calculation.
- New
- Research Article
- 10.3390/foods14213769
- Nov 3, 2025
- Foods
- Yichi Zhang + 5 more
Maize, a globally important crop, is highly susceptible to aflatoxin contamination, posing a serious threat. Therefore, accurate detection of aflatoxin levels in maize is of critical importance. In this study, the Multi-Scale Feature Network with Efficient Channel Attention (MSFNet-ECA) model, based on near-infrared hyperspectral imaging combined with deep learning techniques was developed to analyze the content of aflatoxin B1 (AFB1) in maize. Three data augmentation methods—multiplicative random scaling, bootstrap resampling, and Wasserstein generative adversarial networks (WGAN)—were compared with various preprocessing strategies to assess their impact on model performance. Multiplicative random scaling combined with second derivative (D2) preprocessing yielded the best predictive performance for the MSFNet-ECA model. Using this augmentation, the MSFNet-ECA model outperformed four conventional models (partial least squares regression (PLSR), support vector regression (SVR), extreme learning machine (ELM), and one-dimensional convolutional neural network (1D-CNN)), achieving a root mean square error of prediction (RMSEP) of 2.3 μg·kg−1, coefficient of determination for prediction (Rp2) of 0.99, and the residual predictive deviation (RPD) of 9, with accuracy improvements of 86.4%, 79.1%, 71.3%, and 42.5%, respectively. This finding demonstrates that applying data augmentation methods substantially improves the predictive performance of hyperspectral chemometric models driven by deep learning. Moreover, when combined with data augmentation techniques, the proposed MSFNet-ECA model can accurately predict AFB1 content in maize, offering an efficient and reliable tool for hyperspectral applications in food quality and safety monitoring.
- New
- Research Article
- 10.3390/s25216726
- Nov 3, 2025
- Sensors
- Peiquan Chen + 4 more
The real-time, precise monitoring of physiological signals such as intracranial pressure (ICP) and arterial blood pressure (BP) holds significant clinical importance. However, traditional methods like invasive ICP monitoring and invasive arterial blood pressure measurement present challenges including complex procedures, high infection risks, and difficulties in continuous measurement. Consequently, learning-based prediction utilizing observable signals (e.g., BP/pulse waves) has emerged as a crucial alternative approach. Existing models struggle to simultaneously capture multi-scale local features and long-range temporal dependencies, while their computational complexity remains prohibitively high for meeting real-time clinical demands. To address this, this paper proposes a physiological signal prediction method combining composite feature preprocessing with multiscale modeling. First, a seven-dimensional feature matrix is constructed based on physiological prior knowledge to enhance feature discriminative power and mitigate phase mismatch issues. Second, a network architecture CNN-LSTM-Attention (CBAnet), integrating multiscale convolutions, long short-term memory (LSTM), and attention mechanisms is designed to effectively capture both local waveform details and long-range temporal dependencies, thereby improving waveform prediction accuracy and temporal consistency. Experiments on GBIT-ABP, CHARIS, and our self-built PPG-HAF dataset show that CBAnet achieves competitive performance relative to bidirectional long short-term Memory (BiLSTM), convolutional neural network-long short-term memory network (CNN-LSTM), Transformer, and Wave-U-Net baselines across Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (R2). This study provides a promising, efficient approach for non-invasive, continuous physiological parameter prediction.
- New
- Research Article
- 10.1371/journal.pone.0335418
- Nov 3, 2025
- PloS one
- Poonam Sharma + 4 more
Colorectal cancer (CRC) is the leading cause of cancer disease and poses a significant threat to global health. Although deep learning models have been utilized to accurately diagnose CRC, they still face challenges in capturing the global correlations of spatial features, especially in complex textures and morphologically similar features. To overcome these challenges, we propose a hybrid model using a residual network and transformer encoder with mixed attention. The Residual Next Transformer Network (RNTNet) extracts spatial features from CRC images using ResNeXt. ResNeXt utilizes group convolution and skip connections to capture fine-grained features. Furthermore, a vision transformer (ViT) encoder containing a mixed attention block is designed using multiscale feature aggregation to provide global attention to the spatial features. In addition, a Grad-CAM module is added to visualize the model's decision process to support oncologists with a second opinion. Two publicly available datasets, Kather and KvasirV1, were utilized for model training and testing. The model achieved classification accuracies of 97.96% and 98.20% on the KvasirV1 and Kather datasets, respectively. Model efficacy is also further confirmed by ROC curve analysis, where AUC values of 0.9895 and 0.9937 on the KvasirV1 and Kather datasets are obtained, respectively. Comparative study findings support that RNTNet delivers improvements in accuracy and efficiency compared to state-of-the-art methods.
- New
- Research Article
- 10.1080/10255842.2025.2584381
- Nov 3, 2025
- Computer Methods in Biomechanics and Biomedical Engineering
- Zhaoxing Xu + 4 more
This study proposes a GRU–TCN model with Temporal-Channel Attention (GT-TCA) for dose–time–concentration prediction under data scarcity and multicollinearity. TimeCVAE augments limited pharmacokinetic data with distribution-consistent sequences. GRU captures temporal dependencies, TCN extracts multi-scale features, and attention emphasizes informative time steps and analytes. Experiments on Buyang Huanwu Decoction (normal/inflammatory) and simulations (RG1678, RIF) show GT-TCA reduces MAE by 22.7% and improves R2 by 4% versus baselines (p < 0.05). Ablation confirms attention lowers MAE and RMSE by 6% and 5%. The model demonstrates robustness and provides more precise quantitative evidence to support precision dosing.
- New
- Research Article
- 10.1088/2631-8695/ae154f
- Nov 3, 2025
- Engineering Research Express
- Zhong Chen + 6 more
Self-adaptive sliding map convolution multi-scale feature fusion classification method for LiDAR point clouds of transmission lines in complex terrain environment
- New
- Research Article
- 10.1038/s41598-025-22273-y
- Nov 3, 2025
- Scientific Reports
- Hongyi Duan + 4 more
Infrared remote sensing (IRS) ship detection faces challenges such as low resolution and environmental interference, with issues being particularly pronounced for small targets. This study proposes a lightweight architecture based on RT-DETR, termed RT-DETR-CST: A Cross-Channel Feature Attention Network (CFAN) is constructed, which achieves channel-weighted feature fusion via residual connections to suppress invalid background channels, addressing the problem of inter-channel information imbalance in infrared images and the suppression of small-target features by background noise. A Scale-Wise Feature Network (SWN) is developed, utilizing depthwise separable convolutions and stochastic depth for multi-scale feature extraction, where stochastic depth enhances the model’s robustness to small-target features. A Texture/Detail Capture Network (TCN) is built, achieving edge/detail capture through linear decomposition and low-cost channel fusion to solve the problems of target edge blurring and detail feature loss in infrared images caused by low signal-to-noise ratios. Experiments on the ISDD datasets show that RT-DETR-CST achieves an mAP0.5 metric of 89.4% (a 4.9% improvement over RT-DETR), reduces model size to 23.7 MB (a 41.5% reduction), and achieves an inference speed of 207.2 FPS. Ablation experiments validate the effectiveness of each module, demonstrating the model’s superior accuracy, lightweight design, and real-time performance in infrared ship remote sensing small-target detection. Furthermore, the generalization verification on the SSDD and SIRST datasets shows that the proposed model is effective in both infrared and SAR remote sensing small target detection.
- New
- Research Article
- 10.1038/s41598-025-22177-x
- Nov 3, 2025
- Scientific Reports
- Yahao Wen + 3 more
Humans exhibit a remarkable ability to selectively focus on auditory stimuli in multi-speaker environments, such as cocktail parties. The Auditory Attention Detection (AAD) method aims to identify the conversation that a listener is attending to through the analysis of neural signals, particularly utilizing electroencephalography (EEG) data. However, current methodologies in this domain encounter several significant limitations. While many existing AAD methods use additional information–like spatial or frequency features–to improve decoding accuracy, they often miss the relationships between signals from different EEG channels. To address these shortcomings, this paper introduces a novel hybrid channel attention network for AAD. Our approach is the first to integrate spatial-temporal filtering, dynamic multi-scale feature fusion, and efficient cross-channel attention into a single unified architecture, enabling it to capture complex neural patterns of attention that previous methods overlooked. Our proposed network first extracts spatial-temporal features from raw EEG signals employing a dedicated spatial-temporal feature extraction module. The extracted features are then processed by a module that combines information across different time scales and uses an attention mechanism to identify important relationships between EEG channels. Experimental results demonstrate that our network achieves superior classification performance compared to baseline methods, particularly under conditions with short decision windows. Notably, while maintaining exceptional accuracy, the proposed architecture significantly reduces model parameters.
- New
- Research Article
- 10.1371/journal.pone.0335418.r006
- Nov 3, 2025
- PLOS One
- Poonam Sharma + 5 more
Colorectal cancer (CRC) is the leading cause of cancer disease and poses a significant threat to global health. Although deep learning models have been utilized to accurately diagnose CRC, they still face challenges in capturing the global correlations of spatial features, especially in complex textures and morphologically similar features. To overcome these challenges, we propose a hybrid model using a residual network and transformer encoder with mixed attention. The Residual Next Transformer Network (RNTNet) extracts spatial features from CRC images using ResNeXt. ResNeXt utilizes group convolution and skip connections to capture fine-grained features. Furthermore, a vision transformer (ViT) encoder containing a mixed attention block is designed using multiscale feature aggregation to provide global attention to the spatial features. In addition, a Grad-CAM module is added to visualize the model’s decision process to support oncologists with a second opinion. Two publicly available datasets, Kather and KvasirV1, were utilized for model training and testing. The model achieved classification accuracies of 97.96% and 98.20% on the KvasirV1 and Kather datasets, respectively. Model efficacy is also further confirmed by ROC curve analysis, where AUC values of 0.9895 and 0.9937 on the KvasirV1 and Kather datasets are obtained, respectively. Comparative study findings support that RNTNet delivers improvements in accuracy and efficiency compared to state-of-the-art methods.
- New
- Research Article
- 10.3390/s25216729
- Nov 3, 2025
- Sensors
- Jie Yu + 4 more
To address the issue of insufficient resolution in remote sensing images due to limitations in sensors and transmission, this paper proposes a multi-scale feature fusion model, MSFANet, based on the Swin Transformer architecture for remote sensing image super-resolution reconstruction. The model comprises three main modules: shallow feature extraction, deep feature extraction, and high-quality image reconstruction. The deep feature extraction module innovatively introduces three core components: Feature Refinement Augmentation (FRA), Local Structure Optimization (LSO), and Residual Fusion Network (RFN), which effectively extract and adaptively aggregate multi-scale information from local to global levels. Experiments conducted on three public remote sensing datasets (RSSCN7, AID, and WHU-RS19) demonstrate that MSFANet outperforms state-of-the-art models (including HSENet and TransENet) across five evaluation metrics in ×2, ×3, and ×4 super-resolution tasks. Furthermore, MSFANet achieves superior reconstruction quality with reduced computational overhead, striking an optimal balance between efficiency and performance. This positions MSFANet as an effective solution for remote sensing image super-resolution applications.