The variants of DEtection TRansformer (DETRs) have achieved impressive performance in general object detection. However, they suffer notable performance degradation in scenarios involving crowded pedestrian detection. This decline primarily occurs during the training phase, where DETRs are constrained solely by pedestrian labels. This limitation leads to the production of indistinguishable image features between visually similar pedestrians and background elements, resulting in incorrect detections. To address this issue, this paper introduces a rank-based contrastive learning method, which constructs an additional and specific constraint for each indistinguishable training sample to produce distinguishable image features. Unlike previous methods that rely solely on pedestrian labels to achieve a consistent confidence score, our approach relies on multiple constraints and aims to ensure the correct rank of detection results, with confidence scores of pedestrians consistently surpassing those of background elements. Specifically, we first filter out some training samples that could interfere with our delineation of indistinguishable and distinguishable training samples. Then, based on the confidence score rank, we divide the rest of the training samples into distinguishable positive and negative training samples and indistinguishable positive and negative training samples. Finally, we combine these training samples into multiple positive and negative pairs and utilize these sample pairs to train DETRs via contrastive learning. Our method can be plugged into any DETRs and does not increase any overhead on inference. Extensive experiments on three DETRs show that our method achieves superior performance. Especially on the Crowdhuman dataset, our method achieved the state-of-the-art 38.9% MR.
Read full abstract