Interaction-Integrated Network for Natural Language Moment Localization.

Ke Ning,Qi Tian,Fei Wu,Jianzhuang Liu,Lingxi Xie

doi:10.1109/tip.2021.3052086

Abstract

Natural language moment localization aims at localizing video clips according to a natural language description. The key to this challenging task lies in modeling the relationship between verbal descriptions and visual contents. Existing approaches often sample a number of clips from the video, and individually determine how each of them is related to the query sentence. However, this strategy can fail dramatically, in particular when the query sentence refers to some visual elements that appear outside of, or even are distant from, the target clip. In this paper, we address this issue by designing an Interaction-Integrated Network (I2N), which contains a few Interaction-Integrated Cells (I2Cs). The idea lies in the observation that the query sentence not only provides a description to the video clip, but also contains semantic cues on the structure of the entire video. Based on this, I2Cs go one step beyond modeling short-term contexts in the time domain by encoding long-term video content into every frame feature. By stacking a few I2Cs, the obtained network, I2N, enjoys an improved ability of inference, brought by both (I) multi-level correspondence between vision and language and (II) more accurate cross-modal alignment. When evaluated on a challenging video moment localization dataset named DiDeMo, I2N outperforms the state-of-the-art approach by a clear margin of 1.98%. On other two challenging datasets, Charades-STA and TACoS, I2N also reports competitive performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Interaction-Integrated Network for Natural Language Moment Localization.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Lead the way for us

Journal: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society	Publication Date: Jan 1, 2021
Citations: 69

Similar Papers

Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding
Daizong Liu ... Wei Hu
IEEE Transactions on Multimedia | VOL. 25
Daizong Liu, et. al.Daizong Liu ... Wei Hu
01 Jan 2023
IEEE Transactions on Multimedia | VOL. 25

VMLH: Efficient Video Moment Location via Hashing
Zhifang Tan ... Xinfang Liu
Electronics | VOL. 12
Zhifang Tan, et. al.Zhifang Tan ... Xinfang Liu
13 Jan 2023
Electronics | VOL. 12

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.
Zongmeng Zhang ... Yan Yan
IEEE Transactions on Image Processing | VOL. 30
Zongmeng Zhang, et. al.Zongmeng Zhang ... Yan Yan
01 Jan 2020
IEEE Transactions on Image Processing | VOL. 30

STCM-Net: A symmetrical one-stage network for temporal language localization in videos
Zixi Jia ... Chunbo Li
Neurocomputing | VOL. 471
Zixi Jia, et. al.Zixi Jia ... Chunbo Li
16 Nov 2021
Neurocomputing | VOL. 471

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Interaction-Integrated Network for Natural Language Moment Localization.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society