Complementarity-Aware Space Learning for Video-Text Retrieval

Jinkuan Zhu,Dongliang Liao,Jingkuan Song,Gongfu Li,Lianli Gao,Pengpeng Zeng

doi:10.1109/tcsvt.2023.3235523

Abstract

In general, videos are powerful at recording physical patterns ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">e.g</i> ., spatial layout) while texts are great at describing abstract symbols ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">e.g</i> ., emotion). When video and text are used in multi-modal tasks, they are claimed to be complementary and their distinct information is crucial. However, when it comes to cross-modal tasks ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">e.g</i> ., retrieval), existing works usually use their common part in the form of common space learning while their distinct information is abandoned. In this paper, we argue that distinct information is also beneficial for cross-modal retrieval. To address this problem, we propose a divide-and-conquer learning approach, namely Complementarity-aware Space Learning (CSL), by recasting this challenge into learning of two spaces ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e</i> ., latent and symbolic spaces) to simultaneously explore their common and distinct information by considering multi-modal complementary character. Specifically, we first propose to learn a symbolic space from video with a memory-based video encoder and a symbolic generator. In contrast, we also introduce learning a latent space from text with a text encoder and a memory-based latent feature selector. Finally, we propose a complementarity-aware loss by integrating two spaces to facilitate video-text retrieval tasks. Extensive experiments show that our approach outperforms existing state-of-the-art methods by 5.1%, 2.1% and 0.9% of R@10 for text-to-video retrieval on three benchmarks, respectively. Ablation study also verifies that the distinct information from video and text improves the retrieval performance. Trained models and source code have been released at https://github.com/NovaMind-Z/CSL.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Complementarity-Aware Space Learning for Video-Text Retrieval

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society

Lead the way for us

Journal: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society	Publication Date: Aug 1, 2023
Citations: 3

Similar Papers

Sparse Bayesian approach for metric learning in latent space
Davood Zabihzadeh ... Hadi Sadoghi Yazdi
Knowledge Based Systems | VOL. 178
Davood Zabihzadeh, et. al.Davood Zabihzadeh ... Hadi Sadoghi Yazdi
29 Apr 2019
Knowledge Based Systems | VOL. 178

Reinforcement Learning in Latent Action Sequence Space
Heecheol Kim ... Tomoharu Iwata
-
Heecheol Kim, et. al.Heecheol Kim ... Tomoharu Iwata
24 Oct 2020
24 Oct 2020

Adaptive Generalized Predictive Control Based on Just-in-Time Learning in Latent Space
Zhang Rangwen ... Tian Xuemin
-
Zhang Rangwen, et. al.Zhang Rangwen ... Tian Xuemin
01 Dec 2016
01 Dec 2016

A Survey of Vision and Language Related Multi-Modal Task
Lanxiao Wang ... King Ngi Ngan
CAAI Artificial Intelligence Research | VOL. 1
Lanxiao Wang, et. al.Lanxiao Wang ... King Ngi Ngan
01 Dec 2022
CAAI Artificial Intelligence Research | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Complementarity-Aware Space Learning for Video-Text Retrieval

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society