CALIP: Zero-Shot Enhancement of CLIP with Parameter-Free Attention

Ziyu Guo,Xianzheng Ma,Longtian Qiu,Xupeng Miao,Xuming He,Bin Cui,Renrui Zhang

doi:10.1609/aaai.v37i1.25152

Abstract

Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with promising zero-shot performance. To further improve its downstream accuracy, existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets. However, the resulting extra training cost and data requirement severely hinder the efficiency for model deployment and knowledge transfer. In this paper, we introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free attention module. Specifically, we guide visual and textual representations to interact with each other and explore cross-modal informative features via attention. As the pre-training has largely reduced the embedding distances between two modalities, we discard all learnable parameters in the attention and bidirectionally update the multi-modal features, enabling the whole process to be parameter-free and training-free. In this way, the images are blended with textual-aware signals and the text representations become visual-guided for better adaptive zero-shot alignment. We evaluate CALIP on various benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot classification, showing consistent zero-shot performance improvement over CLIP. Based on that, we further insert a small number of linear layers in CALIP's attention module and verify our robustness under the few-shot settings, which also achieves leading performance compared to existing methods. Those extensive experiments demonstrate the superiority of our approach for efficient enhancement of CLIP. Code is available at https://github.com/ZiyuGuo99/CALIP.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CALIP: Zero-Shot Enhancement of CLIP with Parameter-Free Attention

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 26, 2023
Citations: 13

Similar Papers

Chemical understanding and graphing skills in an honors case‐based computerized chemistry laboratory environment: The value of bidirectional visual and textual representations
Yehudit J Dori ... Irit Sasson
Journal of Research in Science Teaching | VOL. 45
Yehudit J Dori, et. al.Yehudit J Dori ... Irit Sasson
15 Jan 2008
Journal of Research in Science Teaching | VOL. 45

Edinburgh’s End-to-End Multilingual Speech Translation System for IWSLT 2021
Biao Zhang ... Rico Sennrich
-
Biao Zhang, et. al.Biao Zhang ... Rico Sennrich
01 Jan 2020
Edinburgh’s End-to-End Multilingual Speech Translation System for IWSLT 2021
Biao Zhang ... Rico Sennrich

Reinforcement Learning Driven Intra-modal and Inter-modal Representation Learning for 3D Medical Image Classification
Zhonghang Zhu ... Defu Zhang
-
Zhonghang Zhu, et. al.Zhonghang Zhu ... Defu Zhang
01 Jan 2021
01 Jan 2021

Multi-phase attention network for face super-resolution.
Tao Hu ... Yunzhi Chen
PloS one | VOL. 18
Tao Hu, et. al.Tao Hu ... Yunzhi Chen
24 Feb 2023
PloS one | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CALIP: Zero-Shot Enhancement of CLIP with Parameter-Free Attention

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence