Spatiality-Aware Prompt Tuning for Few-Shot Small Object Detection
Small Object Detection (SOD) is challenging due to the scarcity of image features arising from small image regions. The niche nature of small objects additionally poses difficulty in data collection compared to normal-sized objects. Therefore, efficient learning from limited data is benefical for SOD. To tackle few-shot SOD, we propose Spatiality-Aware Prompt Tuning (SAPT), a novel prompt tuning method for vision-language models (VLMs) to deal with the image feature scarcity and the limited data for small objects. SAPT appends the verbalizer prompt, expressing the spatiality of small objects through a template-based sentence, to the text prompt of the pre-trained VLMs. During fine-tuning, the integrated text prompt is learned solely by the decoder of the vision-language detector, while the image and text backbones of the model remain frozen to facilitate efficient learning. In our experiments, we demonstrate the effectiveness of the proposed method on SODA-D and COCO datasets in few-shot and full-shot learning scenarios, and show that our method improves state-of-the-art in both scenarios.