Interactive Teaching For Fine-Granular Few-Shot Object Recognition Using Vision Transformers
In real-world few-shot image classification tasks the lack of abundant data makes training and testing very challenging. The classification model must learn the most meaningful features using only a few sample images without context knowledge. Here, interpretability methods for deep models are helpful for increased comprehensibility and verification. However, these advantages are limited without the ability to correct the model directly. Therefore, we propose an interpretable approach for few-shot object recognition that includes optional interactive teaching to close the feedback loop. We leverage pretrained vision transformers as backbones and a part-based inference particularly favors interpretability. We use a visual concept bank to translate semantic visual features between the human and the model. Even without any human interaction, our model performs competitively compared to state-of-the-art methods in few-shot image classification tasks. Beyond that, we demonstrate the benefits of our interactive interfaces. We show how they can significantly improve the robustness in fine-grained recognition tasks and help to quickly adapt the model without complex fine-tuning.