Traditional object detectors require extensive instance-level annotations for training. Conversely, few-shot object detectors, which are generally fine-tuned using limited data from unknown classes, tend to show biases toward base categories and are susceptible to variations within these unknown samples. To mitigate these challenges, we introduce a Two-Stage Fine-Tuning Approach (TFA) named Extreme R-CNN, designed to operate effectively with extremely limited original samples through the integration of sample synthesis and knowledge distillation. Our approach involves synthesizing new training examples via instance clipping and employing various data-augmentation techniques. We enhance the Faster R-CNN architecture by decoupling the regression and classification components of the Region of Interest (RoI), allowing synthetic samples to train the classification head independently of the object-localization process. Comprehensive evaluations on the Microsoft COCO and PASCAL VOC datasets demonstrate significant improvements over baseline methods. Specifically, on the PASCAL VOC dataset, the average precision for novel categories is enhanced by up to 15 percent, while on the more complex Microsoft COCO benchmark it is enhanced by up to 6.1 percent. Remarkably, in the 1-shot scenario, the AP50 of our model exceeds that of the baseline model in the 10-shot setting within the PASCAL VOC dataset, confirming the efficacy of our proposed method.
Read full abstract