Abstract

Few-shot object detection (FSOD) has received numerous attention due to the difficulty and time-consuming of labeling objects. Recent researches achieve excellent performance in natural scene by only using a few instances of novel classes to fine-tune the last prediction layer of the model well-trained on plentiful base data. However, compared with natural scene objects with a single direction and small size variety, the direction and size of the objects in remote sensing images (RSIs) vary greatly. The methods proposed for natural scene cannot be directly applied for RSIs. In this paper, we first propose a strong baseline for RSIs. It fine-tunes all detector components acting on high-level features and effectively improves the performance of novel classes. Further analyzing the results of the baseline, we find that the error for novel classes is mainly concentrated in classification. It misclassifies novel classes as confusable base classes or background due to the difficulty in extracting generalized information from limited instances. As is well-known, text-modal knowledge can highly summarize the generalized and unique characteristics of categories. Thus, we introduce text-modal descriptions for each category and propose a FSOD method guided by TExt-MOdal knowledge, called TEMO. Specifically, a text-modal knowledge extractor and a cross-modal assembly module are proposed to extract text features and fuse the text-modal features into visual-modal features. The fused features greatly reduce the classification confusion of novel classes. Furthermore, we introduce a mask strategy and a separation loss to avoid over-fitting and ambiguity of text-modal features. Experimental results on DIOR, NWPU, and FAIR1M illustrate that our TEMO achieves state-of-the-art performance on all settings.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call