Abstract

Video moment retrieval with text query aims to retrieve the most relevant segment from the whole video based on the given text query. It is a challenging cross-modal alignment task due to the huge gap between visual and linguistic modalities and the noise generated by manual labeling of time segments. Most of the existing works only use language information in the cross-modal fusion stage, neglecting that language information plays an important role in the retrieval stage. Besides, these works roughly compress the visual information in the video clips to reduce the computation cost which loses subtle video information in the long video. In this paper, we propose a novel model termed Cross-modal Dynamic Networks (CDN) which dynamically generates convolution kernel by visual and language features. In the feature extraction stage, we also propose a frame selection module to capture the subtle video information in the video segment. By this approach, the CDN can reduce the impact of the visual noise without significantly increasing the computation cost and leads to a better video moment retrieval result. The experiments on two challenge datasets, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , Charades-STA and TACoS, show that our proposed CDN method outperforms a bundle of state-of-the-art methods with more accurately retrieved moment video clips. The implementation code and extensive instruction of our proposed CDN method are provided at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/CFM-MSG/Code_CDN</uri> .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call