Abstract Recently, DETR-like detectors, which have shown remarkable performance in general object detection, face limitations when dealing with remote sensing images primarily containing small objects. Mainstream two-stage DETR-like models employ a pipeline that selects and processes a small portion of informative tokens, which enhances performance but also shows a high dependency on token selection. The current static token selection strategies lead to inconsistencies between the static selection criteria and dynamic token updates. Additionally, in remote sensing images, the limited information available for small objects and their inherent sensitivity to pixel shifts further degrade detection performance. To address this, we propose Scale-Adaptive Salience DETR (SAS DETR), a two-stage DETR-like method. SAS DETR incorporates dynamic token filtering, which uses a global threshold predictor to determine the token filtering ratio for each layer of the encoder. This approach selects an appropriate filtering ratio for different network layers while maintaining consistency between the foreground confidence map and token updates. Furthermore, we introduce a novel scale-adaptive salience supervision mechanism that adaptively scales the salience computation area based on object size, ensuring the model more effectively supervises small objects and utilizes the information within tokens without compromising the detection performance for objects of other sizes. Finally, we employ Scale-adaptive Intersection over Union to reduce the impact of pixel shifts on small objects. With these improvements, our proposed SAS DETR achieves 25.2% AP on the AI-TOD-V2 dataset with 24 training epochs and 50.4% AP on the COCO 2017 dataset with 12 training epochs.
Read full abstract