Cross-modal adapter for vision–language retrieval

Haojun Jiang,Jianke Zhang,Rui Huang,Chunjiang Ge,Zanlin Ni,Shiji Song,Gao Huang

doi:10.1016/j.patcog.2024.111144

Abstract

Vision–language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on retrieval tasks. However, as pre-trained models are scaling up, fully fine-tuning them on donwstream retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel Cross-Modal Adapter for parameter-efficient transfer learning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows encoder-level implicit cross-modal interactions between vision and language encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, our approach outperforms adapter-based methods on image–text retrieval datasets (MSCOCO, Flickr30K) and video–text retrieval datasets (MSR-VTT, DiDeMo, and ActivityNet).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-modal adapter for vision–language retrieval

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition

Lead the way for us

Similar Papers

Transfer Learning and Fine-Tuning for Deep Learning-Based Tea Diseases Detection on Small Datasets
Ade Ramdan ... Ana Heryana
-
Ade Ramdan, et. al.Ade Ramdan ... Ana Heryana
18 Nov 2020
18 Nov 2020

Multi-Class Brain Disease Classification Using Modified Pre-Trained Convolutional Neural Networks Model with Substantial Data Augmentation
I Nandhini ... Vijayan Sugumaran
Journal of Medical Imaging and Health Informatics | VOL. 12
I Nandhini, et. al.I Nandhini ... Vijayan Sugumaran
01 Feb 2022
Journal of Medical Imaging and Health Informatics | VOL. 12

Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation
Zhiqi Huang ... James Allan
-
Zhiqi Huang, et. al.Zhiqi Huang ... James Allan
27 Feb 2023
27 Feb 2023

Enhancing coffee bean classification: a comparative analysis of pre-trained deep learning models
Esraa Hassan
Neural Computing and Applications | VOL. 36
Esraa HassanEsraa Hassan
01 Apr 2024
Neural Computing and Applications | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-modal adapter for vision–language retrieval

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition