For drone-based detection tasks, accurately identifying small-scale targets like people, bicycles, and pedestrians remains a key challenge. In this paper, we propose DV-DETR, an improved detection model based on the Real-Time Detection Transformer (RT-DETR), specifically optimized for small target detection in high-density scenes. To achieve this, we introduce three main enhancements: (1) ResNet18 as the backbone network to improve feature extraction and reduce model complexity; (2) the integration of recalibration attention units and deformable attention mechanisms in the neck network to enhance multi-scale feature fusion and improve localization accuracy; and (3) the use of the Focaler-IoU loss function to better handle the imbalanced distribution of target scales and focus on challenging samples. Experimental results on the VisDrone2019 dataset show that DV-DETR achieves an mAP@0.5 of 50.1%, a 1.7% improvement over the baseline model, while increasing detection speed from 75 FPS to 90 FPS, meeting real-time processing requirements. These improvements not only enhance the model's accuracy and efficiency but also provide practical significance in complex, high-density urban environments, supporting real-world applications in UAV-based surveillance and monitoring tasks.
Read full abstract7-days of FREE Audio papers, translation & more with Prime
7-days of FREE Prime access