Recent advances in pre-trained vision-language models have successfully boosted the performance of unsupervised image representation in many vision tasks. Most of existing works focus on learning global visual features with Transformers and neglect detailed local cues, leading to suboptimal performance in fine-grained vision tasks. In this article, we propose a text-guided patch token exploitation framework to enhance the discriminative power of unsupervised representation by exploiting more detailed local features. Our text-guided decoder extracts local features with the guidance of texts or learned prompts describing discriminative object parts. We hence introduce a local-global relation distillation loss to promote the joint optimization of local and global features. The proposed method allows to flexibly extract either global or global-local features as the image representation. It significantly outperforms previous methods in fine-grained image retrieval and base-to-new fine-grained classification tasks. For instance, our Recall@1 metric surpasses the recent unsupervised retrieval method STML by 6.0% on the SOP dataset. The code is publicly available at https://github.com/maosnhehe/TPTE .
Read full abstract