Abstract

Prompt learning has recently emerged as a promising method for fine-tuning vision-language models. By introducing prompts in the text encoder or image encoder, the pre-trained model can quickly adapt to downstream tasks without updating the pre-trained weights. However, prior multi-modal prompt tuning works do not consider the difference in feature distributions between text and images, and adopt the same prompts for both encoders, thus achieving sub-optimal performance in the downstream few-shot learning. In this paper, we propose Modal-Aware Prompt (MAP) to alleviate this issue. Specifically, considering the stability of text features, we design text-specific prompts, which can acquire text class-related information from a general template (i.e., “a photo of a <category>”) by unidirectional attention-based interaction. Additionally, considering the diversity of image features, we design visual-specific prompts to acquire image class-related information and adjust the image features by bidirectional attention-based interaction. To learn hierarchical prompt representations and reinforce the prompt features, we further propose a Deep Adaptive Feature Enhancement (DAFE) module to adaptively utilize the prompt output of the former layer, which can combine instance-level and task-level information simultaneously. Combining the above two designs, our method MAP-DAFE obtains state-of-the-art results on 11 image recognition datasets and has the fastest convergence rate. This proves our MAP-DAFE is effective and efficient.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call