Abstract

Prompt learning has recently emerged as a promising method for fine-tuning vision-language models. By introducing prompts in the text encoder or image encoder, the pre-trained model can quickly adapt to downstream tasks without updating the pre-trained weights. However, prior multi-modal prompt tuning works do not consider the difference in feature distributions between text and images, and adopt the same prompts for both encoders, thus achieving sub-optimal performance in the downstream few-shot learning. In this paper, we propose Modal-Aware Prompt (MAP) to alleviate this issue. Specifically, considering the stability of text features, we design text-specific prompts, which can acquire text class-related information from a general template (i.e., “a photo of a <category>”) by unidirectional attention-based interaction. Additionally, considering the diversity of image features, we design visual-specific prompts to acquire image class-related information and adjust the image features by bidirectional attention-based interaction. To learn hierarchical prompt representations and reinforce the prompt features, we further propose a Deep Adaptive Feature Enhancement (DAFE) module to adaptively utilize the prompt output of the former layer, which can combine instance-level and task-level information simultaneously. Combining the above two designs, our method MAP-DAFE obtains state-of-the-art results on 11 image recognition datasets and has the fastest convergence rate. This proves our MAP-DAFE is effective and efficient.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.