DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Xiangpeng Yang,Xiaohan Wang,Yi Yang,Linchao Zhu

doi:10.1609/aaai.v38i7.28475

Abstract

Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Progress in protein pre-training models integrating structural knowledge
Tian-Yi Tang ... Wen-Fei Li
Acta Physica Sinica | VOL. 73
Tian-Yi Tang, et. al.Tian-Yi Tang ... Wen-Fei Li
01 Jan 2024
Acta Physica Sinica | VOL. 73

Expanding Large Pre-trained Unimodal Models with Multimodal Information Injection for Image-Text Multimodal Classification
Tao Liang ... Fengmao Lv
-
Tao Liang, et. al.Tao Liang ... Fengmao Lv
01 Jun 2022
01 Jun 2022

TiBERT: Tibetan Pre-trained Language Model
Sisi Liu ... Junjie Deng
-
Sisi Liu, et. al.Sisi Liu ... Junjie Deng
09 Oct 2022
09 Oct 2022

Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models
Yiwen Tang ... Bin Zhao
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Yiwen Tang, et. al.Yiwen Tang ... Bin Zhao
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence