Abstract

Vision-language pre-trained models have focused on fine-tuning methods to enhance generalization in downstream tasks, such as CLIP. Recent research proposes fine-tuning these models using prompting and adapter techniques. However, prompting methods tend to overfit class-specific data distributions, and most adapters ignore the prototype representations as the weights, resulting in poor generalization. To tackle these challenges, we propose a novel method that integrates the cluster prototype Earth Mover’s Distance adapters and alignment-guided prompt learning (CPAAP) for vision-language models. Specifically, our adapter designs three components: graph convolutional network-based clusters, hierarchy prototype representations, and EMD similarity, providing a robust metric for measuring feature representations. Additionally, alignment-guided prompt learning enforces a constraint in the prediction of trainable and frozen models, enabling effective adaptation to downstream tasks. Extensive experiments are conducted on few-shot classification, base-to-novel generalization, cross-dataset evaluation, and domain generalization. In practice, CPAAP outperforms state-of-the-art methods on zero-shot tasks, achieving an absolute gain of 0.46% on the harmonic mean across 11 popular datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.