With the rapid advancement and application of Unmanned Aerial Vehicles (UAVs), target detection in urban scenes has made significant progress. Achieving precise 3D reconstruction from oblique imagery is essential for accurate urban object detection in UAV images. However, challenges persist due to low detection accuracy caused by subtle target features, complex backgrounds, and the prevalence of small targets. To address these issues, we introduce the Polysemantic Cooperative Detection Transformer (Pc-DETR), a novel end-to-end UAV image target detection network. Our primary innovation, the Polysemantic Transformer (PoT) Backbone, enhances visual representation by leveraging contextual information to guide a dynamic attention matrix. This matrix, formed through convolutions, captures both static and dynamic features, resulting in superior detection. Additionally, we propose the Polysemantic Cooperative Mixed-Task Training scheme, which employs multiple auxiliary heads for diverse label assignments, boosting the encoder’s learning capacity. This approach customizes queries and optimizes training efficiency without increasing inference costs. Comparative experiments show that Pc-DETR achieves a 3% improvement in detection accuracy over the current state-of-the-art MFEFNet, setting a new benchmark in UAV image detection and advancing methodologies for intelligent UAV surveillance systems.
Read full abstract