Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Kaibin Tian,Han Li,Xinglin Hou,Yi Liu,Quan Chen,Yanhua Cheng

doi:10.1609/aaai.v38i6.28327

Abstract

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

이동 객체의 내용 및 개념 기반 검색을 위한 시공간 모델링에 근거한 시그니쳐 기반 비디오 색인 기법
Chun-Bo Sim ... Jae-U Jang
The KIPS Transactions:PartD | VOL. 9D
Chun-Bo Sim, et. al.Chun-Bo Sim ... Jae-U Jang
01 Feb 2002
The KIPS Transactions:PartD | VOL. 9D

Modeling Brain Dynamics During Virtual Reality-Based Emergency Response Learning Under Stress.
Oshin Tyagi ... Yangming Shi
Human Factors: The Journal of the Human Factors and Ergonomics Society | VOL. 65
Oshin Tyagi, et. al.Oshin Tyagi ... Yangming Shi
05 Dec 2021
Human Factors: The Journal of the Human Factors and Ergonomics Society | VOL. 65

Efficient Retrieval of Human Motion Episodes Based on Indexed Motion-Word Representations
Petra Budikova ... Jan Sedmidubsky
International Journal of Semantic Computing | VOL. 15
Petra Budikova, et. al.Petra Budikova ... Jan Sedmidubsky
01 Jun 2021
International Journal of Semantic Computing | VOL. 15

Discrete online cross-modal hashing with consistency preservation
Xiao Kang ... Yilong Yin
Pattern Recognition | VOL. 155
Xiao Kang, et. al.Xiao Kang ... Yilong Yin
14 Jun 2024
Pattern Recognition | VOL. 155

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence