COMMA: Co-articulated Multi-Modal Learning

Lianyu Hu,Wei Feng,Zekang Liu,Chi-Man Pun,Liqing Gao

doi:10.1609/aaai.v38i3.27997

Abstract

Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency. Code is available at https://github.com/hulianyuyy/COMMA.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

COMMA: Co-articulated Multi-Modal Learning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Memory Integrity of CNNs for Cross-Dataset Facial Expression Recognition
Dylan C Tannugi ... Alceu S Britto
-
Dylan C Tannugi, et. al.Dylan C Tannugi ... Alceu S Britto
01 Oct 2019
01 Oct 2019

Self Pre-training with Single-Scale Adapter for Left Atrial Segmentation
Can Tu ... Jin Ye
-
Can Tu, et. al.Can Tu ... Jin Ye
01 Jan 2023
01 Jan 2023

MDTGAN: Multi domain generative adversarial transfer learning network for traffic data imputation
Jie Fang ... Hongting Chen
Expert Systems With Applications | VOL. 255
Jie Fang, et. al.Jie Fang ... Hongting Chen
25 Jun 2024
Expert Systems With Applications | VOL. 255

Achieving consensus on the essential knowledge and skills needed by nursing students to promote planetary health and sustainable healthcare: A Delphi study
Tracy Levett‐Jones ... Jacqueline Pich
Journal of Advanced Nursing | VOL. -
Tracy Levett‐Jones, et. al.Tracy Levett‐Jones ... Jacqueline Pich
07 Jun 2024
Journal of Advanced Nursing | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

COMMA: Co-articulated Multi-Modal Learning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence