Text-guided distillation learning to diversify video embeddings for text-video retrieval

Sangmin Lee,Hyung-Il Kim,Yong Man Ro

doi:10.1016/j.patcog.2024.110754

Abstract

Conventional text-video retrieval methods typically match a video with a text on a one-to-one manner. However, a single video can contain diverse semantics, and text descriptions can vary significantly. Therefore, such methods fail to match a video with multiple texts simultaneously. In this paper, we propose a novel approach to tackle this one-to-many correspondence problem in text-video retrieval. We devise diverse temporal aggregation and a multi-key memory to address temporal and semantic diversity, consequently constructing multiple video embedding paths from a single video. Additionally, we introduce text-guided distillation learning that enables each video path to acquire meaningful distinct competencies in representing varied semantics. Our video embedding approach is text-agnostic, allowing the prepared video embeddings to be used continuously for any new text query. Experiments show our method outperforms existing methods on four datasets. We further validate the effectiveness of our designs with ablation studies and analyses on diverse video embeddings.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Text-guided distillation learning to diversify video embeddings for text-video retrieval

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition

Lead the way for us

Similar Papers

VSRNet: End-to-end video segment retrieval with text query
Xiao Sun ... Zhouhui Lian
Pattern Recognition | VOL. 119
Xiao Sun, et. al.Xiao Sun ... Zhouhui Lian
23 May 2021
Pattern Recognition | VOL. 119

Guided Graph Attention Learning for Video-Text Matching
Kunpeng Li ... Mike Stopa
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 18
Kunpeng Li, et. al.Kunpeng Li ... Mike Stopa
30 Jun 2022
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 18

Machine Learning for Information Retrieval

-

01 Jan 2008
01 Jan 2008

Listen as you wish: Fusion of audio and text for cross-modal event detection in smart cities
Haoyu Tang ... Qinghai Zheng
Information Fusion | VOL. 110
Haoyu Tang, et. al.Haoyu Tang ... Qinghai Zheng
08 May 2024
Information Fusion | VOL. 110

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Text-guided distillation learning to diversify video embeddings for text-video retrieval

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition