Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query

Gongmian Wang,Huimin Lu,Xing Xu,Heng Tao Shen,Yanli Ji,Fumin Shen

doi:10.1109/tmm.2022.3142420

Abstract

Video moment retrieval with text query aims to retrieve the most relevant segment from the whole video based on the given text query. It is a challenging cross-modal alignment task due to the huge gap between visual and linguistic modalities and the noise generated by manual labeling of time segments. Most of the existing works only use language information in the cross-modal fusion stage, neglecting that language information plays an important role in the retrieval stage. Besides, these works roughly compress the visual information in the video clips to reduce the computation cost which loses subtle video information in the long video. In this paper, we propose a novel model termed Cross-modal Dynamic Networks (CDN) which dynamically generates convolution kernel by visual and language features. In the feature extraction stage, we also propose a frame selection module to capture the subtle video information in the video segment. By this approach, the CDN can reduce the impact of the visual noise without significantly increasing the computation cost and leads to a better video moment retrieval result. The experiments on two challenge datasets, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , Charades-STA and TACoS, show that our proposed CDN method outperforms a bundle of state-of-the-art methods with more accurately retrieved moment video clips. The implementation code and extensive instruction of our proposed CDN method are provided at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/CFM-MSG/Code_CDN</uri> .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia

Lead the way for us

Journal: IEEE Transactions on Multimedia	Publication Date: Jan 1, 2022
Citations: 30

Similar Papers

Locating Information in Video by Browsing and Searching
Andreas Girgensohn ... John Adcock
-
Andreas Girgensohn, et. al.Andreas Girgensohn ... John Adcock
01 Jan 2006
01 Jan 2006

Changing knowledge and attitudes about childhood fever: testing a video instruction before its application in a health app
Moritz Gwiasda ... David Martin
GMS Journal for Medical Education | VOL. 39
Moritz Gwiasda, et. al.Moritz Gwiasda ... David Martin
14 Apr 2022
GMS Journal for Medical Education | VOL. 39

YouTube-VOS: Sequence-to-Sequence Video Object Segmentation
Ning Xu ... Dingcheng Yue
-
Ning Xu, et. al.Ning Xu ... Dingcheng Yue
01 Jan 2018
01 Jan 2018

Segmentation of news videos based on audio-video information
Massimo De Santo ... Mario Vento
Pattern Analysis and Applications | VOL. 10
Massimo De Santo, et. al.Massimo De Santo ... Mario Vento
28 Nov 2006
Pattern Analysis and Applications | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia