RE-YOLO: a lightweight small object detection method for UAV remote sensing imagery
ABSTRACT To address the challenges of accurately extracting target features from complex scenes in UAV remote sensing imagery and the susceptibility of small objects to being obscured by noise, this paper proposes a lightweight detection algorithm, RE-YOLO, based on YOLOv8n. First, a multi-scale convolutional module named RFCSConv, which integrates channel and spatial attention mechanisms based on Receptive Field Attention Convolution (RFAConv), replaces the original convolution layers. This enhances feature selection and fusion at multiple scales. Second, the Efficient Squeeze-and-Excitation Module (ESEModule) is introduced into the backbone to strengthen feature representation while reducing computational overhead. Lastly, a composite loss function called Win-IoU, combining Wise-IoU (WIoU) and Inner-IoU, is proposed to dynamically adjust gradient contributions based on anchor quality. Experimental results on the VisDrone2019 dataset demonstrate that RE-YOLO achieves 29.7% mAP@0.5 with only 3.2MB of parameters and a real-time speed of 150 FPS. The algorithm also generalizes well across the HRSID and CARPK datasets, achieving 91.8% and 94.3% mAP@0.5 respectively.
- Research Article
- 10.1142/s0218001425550043
- May 1, 2025
- International Journal of Pattern Recognition and Artificial Intelligence
Dense small object detection in complex scenes is a valuable and challenging research field. While deep learning has driven significant advancements in computer vision, traditional object detection models still struggle to achieve high accuracy in detecting small objects, particularly in large-scale aerial images. Challenges such as scale variations, occlusions, and complex backgrounds continue to hinder the effective detection of dense small objects. In this paper, we present the Scale Transformer Small Object Detection Network (STSODNet), a novel architecture designed to address these challenges. First, we conceptualize the pronounced scale variation in drone images as an anomalous disturbance and propose a multiscale feature enhancement module (MSFEM), built upon the Spatial Transformer Network, to mitigate this effect. The multiscale feature enhancement module performs learnable, multi-point magnification on regions surrounding objects based on spatial saliency, enhancing the model’s scale invariance. Second, to generate a more accurate global saliency map and heighten the model’s focus on small target regions, we introduce a refined spatial attention mechanism, termed Spatial Region Attention. This mechanism combines coarse region attention with fine spatial attention to produce a more detailed saliency map and improve long-range dependency capture. Third, to achieve more accurate spatial regression of small objects, the traditional three-layer detection head is improved by expanding its output layer, resulting in a finer and larger output while maintaining the same number of parameters. Extensive experiments on the VisDrone and SeaPerson benchmark datasets validate that STSODNet achieves superior precision and robustness, outperforming current state-of-the-art object detection methods for small object detection.
- Research Article
3
- 10.1142/s0218001423500246
- Aug 1, 2023
- International Journal of Pattern Recognition and Artificial Intelligence
With the rapid development of computer vision and artificial intelligence technology, visual object detection has made unprecedented progress, and small object detection in complex scenes has attracted more and more attention. To solve the problems of ambiguity, overlap and occlusion in small object detection in complex scenes. In this paper, a multi-scale fusion feature enhanced path aggregation network MSFE-PANet is proposed. By adding attention mechanism and feature fusion, the fusion of strong positioning information of deep feature map and strong semantic information of shallow feature map is enhanced, which helps the network to find interesting areas in complex scenes and improve its sensitivity to small objects. The rejection loss function and network prediction scale are designed to solve the problems of missing detection and false detection of overlapping and blocking small objects in complex backgrounds. The proposed method achieves an accuracy of 40.7% on the VisDrone2021 dataset and 89.7% on the PASCAL VOC dataset. Comparative analysis with mainstream object detection algorithms proves the superiority of this method in detecting small objects in complex scenes.
- Research Article
17
- 10.3390/app122211854
- Nov 21, 2022
- Applied Sciences
In order to alleviate the situation that small objects are prone to missed detection and false detection in natural scenes, this paper proposed a small object detection algorithm for adaptive feature fusion, referred to as MMF-YOLO. First, aiming at the problem that small object pixels are easy to lose, a multi-branch cross-scale feature fusion module with fusion factor was proposed, where each fusion path has an adaptive fusion factor, which can allow the network to independently adjust the importance of features according to the learned weights. Then, aiming at the problem that small objects are similar to background information and small objects overlap in complex scenes, the M-CBAM attention mechanism was proposed, which was added to the feature reinforcement extraction module to reduce feature redundancy. Finally, in light of the problem of small object size and large size span, the size of the object detection head was modified to adapt to the small object size. Experiments on the VisDrone2019 dataset showed that the mAP of the proposed algorithm could reach 42.23%, and the parameter quantity was only 29.33 MB, which is 9.13% ± 0.07% higher than the benchmark network mAP, and the network model was reduced by 5.22 MB.
- Research Article
1
- 10.3390/electronics14112274
- Jun 2, 2025
- Electronics
Object detection algorithms have evolved from two-stage to single-stage architectures, with foundation models achieving sustained improvements in accuracy. However, in intelligent retail scenarios, small object detection and occlusion issues still lead to significant performance degradation. To address these challenges, this paper proposes an improved model based on YOLOv11, focusing on resolving insufficient multi-scale feature coupling and occlusion sensitivity. First, a multi-scale feature extraction network (MFENet) is designed. It splits input feature maps into dual branches along the channel dimension: the upper branch performs local detail extraction and global semantic enhancement through secondary partitioning, while the lower branch integrates CARAFE (content-aware reassembly of features) upsampling and SENet (squeeze-and-excitation network) channel weight matrices to achieve adaptive feature enhancement. The three feature streams are fused to output multi-scale feature maps, significantly improving small object detail retention. Second, a convolutional block attention module (CBAM) is introduced during feature fusion, dynamically focusing on critical regions through channel–spatial dual attention mechanisms. A fuseModule is designed to aggregate multi-level features, enhancing contextual modeling for occluded objects. Additionally, the extreme-IoU (XIoU) loss function replaces the traditional complete-IoU (CIoU), combined with XIoU-NMS (extreme-IoU non-maximum suppression) to suppress redundant detections, optimizing convergence speed and localization accuracy. Experiments demonstrate that the improved model achieves a mean average precision (mAP50) of 0.997 (0.2% improvement) and mAP50-95 of 0.895 (3.5% improvement) on the RPC product dataset and the 6th Product Recognition Challenge dataset. The recall rate increases to 0.996 (0.6% improvement over baseline). Although frames per second (FPS) decreased compared to the original model, the improved model still meets real-time requirements for retail scenarios. The model exhibits stable noise resistance in challenging environments and achieves 84% mAP in cross-dataset testing, validating its generalization capability and engineering applicability. Video streams were captured using a Zhongweiaoke camera operating at 60 fps, satisfying real-time detection requirements for intelligent retail applications.
- Research Article
- 10.3390/drones9090610
- Aug 29, 2025
- Drones
In recent years, detection methods for generic object detection have achieved significant progress. However, due to the large number of small objects in aerial images, mainstream detectors struggle to achieve a satisfactory detection performance. The challenges of small object detection in aerial images are primarily twofold: (1) Insufficient feature representation: The limited visual information for small objects makes it difficult for models to learn discriminative feature representations. (2) Background confusion: Abundant background information introduces more noise and interference, causing the features of small objects to easily be confused with the background. To address these issues, we propose a Multi-Level Contextual and Semantic Information Aggregation Network (MCSA-Net). MCSA-Net includes three key components: a Spatial-Aware Feature Selection Module (SAFM), a Multi-Level Joint Feature Pyramid Network (MJFPN), and an Attention-Enhanced Head (AEHead). The SAFM employs a sequence of dilated convolutions to extract multi-scale local context features and combines a spatial selection mechanism to adaptively merge these features, thereby obtaining the critical local context required for the objects, which enriches the feature representation of small objects. The MJFPN introduces multi-level connections and weighted fusion to fully leverage the spatial detail features of small objects in feature fusion and enhances the fused features further through a feature aggregation network. Finally, the AEHead is constructed by incorporating a sparse attention mechanism into the detection head. The sparse attention mechanism efficiently models long-range dependencies by computing the attention between the most relevant regions in the image while suppressing background interference, thereby enhancing the model’s ability to perceive targets and effectively improving the detection performance. Extensive experiments on four datasets, VisDrone, UAVDT, MS COCO, and DOTA, demonstrate that the proposed MCSA-Net achieves an excellent detection performance, particularly in small object detection, surpassing several state-of-the-art methods.
- Conference Article
3
- 10.23919/ccc55666.2022.9902202
- Jul 25, 2022
In deep learning, object detection has achieved a very large performance improvement. However, due to the few features available for small objects, network structure, sample imbalance and other reasons, the result is unsatisfactory in small object detection. In order to solve this problem, this paper proposes a method based on the combination of mutil-scale feature fusion and dilated convolution, which uses dilated convolution to expand the receptive field of feature maps at different scales and then extracts the high-level semantic information and low-level semantic information from the backbone network. The obtained feature maps of different receptive fields are fused to obtain the final feature map prediction information. In addition, we add a series of channel attention and spatial attention mechanisms to the network to better obtain the context information of the object in the image. Experiments show that this method can have higher accuracy than the traditional YOLOv3 network in the detection of small objects. In addition, the size of 640*640 images, we can achieve 31.5% accuracy in the detection of small objects in MS COCO2017. Compared with YOLOv5, there are 4 points of improvement.
- Research Article
8
- 10.1038/s41598-025-85961-9
- Jan 18, 2025
- Scientific Reports
In the underwater domain, small object detection plays a crucial role in the protection, management, and monitoring of the environment and marine life. Advancements in deep learning have led to the development of many efficient detection techniques. However, the complexity of the underwater environment, limited information available from small objects, and constrained computational resources make small object detection challenging. To tackle these challenges, this paper presents an efficient deep convolutional network model. First, a CSP for small object and lightweight (CSPSL) module is introduced to enhance feature retention and preserve essential details. Next, a variable kernel convolution (VKConv) is proposed to dynamically adjust the convolution kernel size, enabling better multi-scale feature extraction. Finally, a spatial pyramid pooling for multi-scale (SPPFMS) method is presented to preserve the features of small objects more effectively. Ablation experiments on the UDD dataset demonstrate the effectiveness of the proposed methods. Comparative experiments on the UDD and DUO datasets demonstrate that the proposed model delivers the best performance in terms of computational cost and detection accuracy, outperforming state-of-the-art methods in real-time underwater small object detection tasks.
- Research Article
47
- 10.1109/tim.2022.3196319
- Jan 1, 2022
- IEEE Transactions on Instrumentation and Measurement
Unmanned aerial vehicles (UAVs) have been widely used in post-disaster search and rescue operations, object tracking, and other tasks. Therefore, the autonomous perception of UAVs based on computer vision has become a research hotspot in recent years. However, UAV images include dense objects, small objects, and arbitrary object directions, which bring about significant challenges to existing object detection methods. To alleviate these issues, we propose a global-local feature enhanced network (GLF-Net). Considering the difficulty of processing UAV images with complex scenes and dense objects, we designed a backbone based on an involution and self-attention that can extract effective features from complex objects. A multiscale feature fusion module is also proposed to address the presence of numerous small objects in UAV images through multiscale object detection and feature fusion. To accurately detect rotated objects, a rotated regional proposal network was designed based on the midpoint offset representation, which can apply a rotated box to determine the real direction and contour of an object. GLF-Net achieves a state-of-the-art detection accuracy (86.52% mAP) on our created RO-UAV dataset, while achieving 96.95% and 97% mAP on the public datasets HRSC2016 and UCAS-AOU, respectively. The experimental results demonstrate that our method achieves a high detection accuracy and generalization, which can meet the practical requirements of UAVs under various complex scenarios.
- Research Article
3
- 10.3390/rs17142421
- Jul 12, 2025
- Remote Sensing
With special consideration for complex scenes and densely distributed small objects, this frequently leads to serious false and missed detections for unmanned aerial vehicle (UAV) images in small object detection scenarios. Consequently, we propose a UAV image small object detection algorithm, termed SMA-YOLO. Firstly, a parameter-free simple slicing convolution (SSC) module is integrated in the backbone network to slice the feature maps and enhance the features so as to effectively retain the features of small objects. Subsequently, to enhance the information exchange between upper and lower layers, we design a special multi-cross-scale feature pyramid network (M-FPN). The C2f-Hierarchical-Phantom Convolution (C2f-HPC) module in the network effectively reduces information loss by fine-grained multi-scale feature fusion. Ultimately, adaptive spatial feature fusion detection Head (ASFFDHead) introduces an additional P2 detection head to enhance the resolution of feature maps to better locate small objects. Moreover, the ASFF mechanism is employed to optimize the detection process by filtering out information conflicts during multi-scale feature fusion, thereby significantly optimizing small object detection capability. Using YOLOv8n as the baseline, SMA-YOLO is evaluated on the VisDrone2019 dataset, achieving a 7.4% improvement in mAP@0.5 and a 13.3% reduction in model parameters, and we also verified its generalization ability on VAUDT and RSOD datasets, which demonstrates the effectiveness of our approach.
- Research Article
7
- 10.3390/electronics12224621
- Nov 12, 2023
- Electronics
This study addresses the challenges that conventional network models face in detecting small foreign objects on industrial production lines, exemplified by scenarios where a single piece of iron filing occupies approximately 0.002% of the image area. To tackle this, we introduce an enhanced YOLOv8-MeY model for detecting foreign objects on the surface of sugar bags. Firstly, the introduction of a 160 × 160-scale small object detection layer and integration of the Global Attention Mechanism (GAM) attention module into the feature fusion network (Neck) increased the network’s focus on small objects. This enhancement improved the network’s feature extraction and fusion capabilities, which ultimately increased the accuracy of small object detection. Secondly, the model employs the lightweight network GhostNet, replacing YOLOv8’s principal feature extraction network, DarkNet53. This adaptation not only diminishes the quantity of network parameters but also augments feature extraction capabilities. Furthermore, we substituted the Bottleneck in the C2f of the YOLOv8 model with the Spatial and Channel Reconstruction Convolution (SCConv) module, which, by mitigating the spatial and channel redundancy inherent in standard convolutions, reduced computational demands while elevating the performance of the convolutional network model. The model has been effectively applied to the automated sugar dispensing process in food factories, exhibiting exemplary performance. In detecting diverse foreign objects like 2 mm iron filings, 7 mm wires, staples, and cockroaches, the YOLOv8-MeY model surpasses the Faster R-CNN model and the contemporaneous YoloV8n model of equivalent parameter scale across six metrics: precision, recall, mAP@0.5, parameters, GFLOPs, and model size. Through 400 manual placement tests involving four types of foreign objects, our statistical results reveal that the model achieves a recognition rate of up to 92.25%. Ultimately, we have successfully deployed this model in automated sugar bag dispensing scenarios.
- Research Article
3
- 10.3389/fmars.2025.1542832
- Apr 15, 2025
- Frontiers in Marine Science
Although side-scan sonar can provide wide and high-resolution views of submarine terrain and objects, it suffers from severe interference due to complex environmental noise, variations in sonar configuration (such as frequency, beam pattern, etc.), and the small scale of targets, leading to a high misdetection rate. These challenges highlight the need for advanced detection models that can effectively address these limitations. Here, this paper introduces an enhanced YOLOv9(You Only Look Once v9) model named SOCA-YOLO, which integrates a Small Object focused Convolution module and an Attention mechanism to improve detection performance to tackle the challenges. The SOCA-YOLO framework first constructs a high-resolution SSS (sidescan sonar image) enhancement pipeline through image restoration techniques to extract fine-grained features of micro-scale targets. Subsequently, the SPDConv (Space-to-Depth Convolution) module is incorporated to optimize the feature extraction network, effectively preserving discriminative characteristics of small targets. Furthermore, the model integrates the standardized CBAM (Convolutional Block Attention Module) attention mechanism, enabling adaptive focus on salient regions of small targets in sonar images, thereby significantly improving detection robustness in complex underwater environments. Finally, the model is verified on a public side-scan sonar image dataset Cylinder2. Experiment results indicate that SOCA-YOLO achieves Precision and Recall at 71.8% and 72.7%, with an mAP50 of 74.3%. It outperforms the current state-of-the-art object detection method, YOLO11, as well as the original YOLOv9. Specifically, our model surpasses YOLO11 and YOLOv9 by 2.3% and 6.5% in terms of mAP50, respectively. Therefore, the SOCA-YOLO model provides a new and effective approach for small underwater object detection in side-scan sonar images.
- Conference Article
9
- 10.1117/12.2194606
- Oct 15, 2015
Small size object detection in vast ocean plays an important role in rescues after accident or disaster. One of the promising approach is a hyperspectral imaging system (HIS). However, due to the limitation of HIS sensor’s resolution, interested target might occupy only several pixels or less in the image, it’s difficult to detect small object, moreover the sun glint of the sea surface make it even more difficult. In this paper, we propose an image analysis technique suitable for the computer aided detection of small objects on the sea surface, especially humans. We firstly separate objects from background by adapting a previously proposed image enhancement method and then apply a linear unmixing method to define the endmember’s spectrum. At last, we use spectral angle mapping method to classify presented objects and thus detect small size object. The proposed system provides the following results for supporting the detection of humans and other small objects on the sea surface; an image with spectral color enhancement, alerts of various objects, and the human detection results. This multilayered approach is expected to reduce the oversight, i.e., false negative error. Results of the proposed technique have been compared with existent methods, and our method has successfully enhance the hyperspectral image, and detect small object from the sea surface with high human detection rate, shows the ability to further detection of human in this study). The result are less influenced by the sun glint effects. This study helps recognizing small objects on the sea surface, and it leads to advances in the rescuing system using aircraft equipped HIS technology.
- Research Article
7
- 10.1088/1742-6596/2450/1/012088
- Mar 1, 2023
- Journal of Physics: Conference Series
The Single Shot MultiBox Detector (SSD) is a well-known object detection method, but its detection of small objects is not effective. This paper makes modifications to the SSD object detection method to address its insufficient semantic information in low-level feature maps, thus enhancing the detectability for small objects. First, the Feature Pyramid Network (FPN) is incorporated into the SSD so that the shallow feature map, which is primarily utilized for detecting small objects, contains more semantic information in addition to rich location information. Second, the Convolutional Block Attention Module (CBAM) is introduced to reinforce the SSD network’s capability to learn key features and thus improve missed detections. The experimental data indicate that this algorithm achieves 78.1% mAP in the PASCAL VOC2007test, a 3.9% improvement compared with the conventional SSD, and also has a great improvement compared with the Fast R-CNN and Faster R-CNN. As well as,this algorithm is better for small object detection and also meets the real-time requirements.
- Research Article
- 10.3390/app16041673
- Feb 7, 2026
- Applied Sciences
Existing DEtection TRansformer-based (DETR) object detection methods have been widely applied to standard object detection tasks, but still face numerous challenges in detecting small objects. These methods frequently miss the fine details of small objects and fail to preserve global context, particularly under scale variation or occlusion. The resulting feature maps lack sufficient spatial and structural information. Moreover, some DETR-based models specifically designed for small object detection often have poor generalization capabilities and are difficult to adapt to datasets with diverse object scales and complex backgrounds. To address these issues, this paper proposes a novel object detection model—small object detection with efficient multi-scale collaborative attention and depth feature fusion based on DETR (ED-DETR)—which consists of three core modules: an efficient multi-scale collaborative attention mechanism (EMCA), DepthPro, a zero-shot metric monocular depth estimation model, and an adaptive feature fusion module for depth maps and feature maps. Specifically, EMCA extends the single-space attention mechanism in efficient multi-scale attention (EMA) to a composite structure of parallel spatial and channel attention, enhancing ED-DETR’s ability to express features collaboratively in both spatial and channel dimensions. DepthPro generates depth maps to extract depth information. The adaptive feature fusion module integrates depth information with RGB visual features, improving ED-DETR’s ability to perceive object position, scale, and occlusion. The experimental results show that ED-DETR achieves the current best 33.6% mAP on the AI-TOD-V2 dataset, which predominantly contains tiny objects, outperforming previous CNN-based and DETR-based methods, and shows excellent generalization performance on the VisDrone and COCO datasets.
- Research Article
20
- 10.3390/s22124339
- Jun 8, 2022
- Sensors (Basel, Switzerland)
One common issue of object detection in aerial imagery is the small size of objects in proportion to the overall image size. This is mainly caused by high camera altitude and wide-angle lenses that are commonly used in drones aimed to maximize the coverage. State-of-the-art general purpose object detector tend to under-perform and struggle with small object detection due to loss of spatial features and weak feature representation of the small objects and sheer imbalance between objects and the background. This paper aims to address small object detection in aerial imagery by offering a Convolutional Neural Network (CNN) model that utilizes the Single Shot multi-box Detector (SSD) as the baseline network and extends its small object detection performance with feature enhancement modules including super-resolution, deconvolution and feature fusion. These modules are collectively aimed at improving the feature representation of small objects at the prediction layer. The performance of the proposed model is evaluated using three datasets including two aerial images datasets that mainly consist of small objects. The proposed model is compared with the state-of-the-art small object detectors. Experiment results demonstrate improvements in the mean Absolute Precision (mAP) and Recall values in comparison to the state-of-the-art small object detectors that investigated in this study.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.