Context-aware network with foreground recalibration for grounding natural language in video

Cheng Chen,Xiaodong Gu

doi:10.1007/s00521-021-05807-z

Abstract

Grounding natural language in video aims at retrieving a matching moment in a long, untrimmed video described by a referring natural language query. It is a challenging issue due to the dominating influence from noise background in untrimmed video and the complex temporal relationships introduced by the query. Existing methods treat different candidate segments separately in a matching and aligning manner and thus neglect that different target segments require different levels of context information. In this paper, we present the semantic modulation residual module, a novel single-shot feed-forward residual network that explicitly integrates various temporal scale features and introduces less noise to the final moments representation with the guide of query semantic information. To establish more fine-grained interactions between different moments, a global interaction module is embedded in the network. Moreover, the data imbalance issue caused by the sparse annotated moments weakens the effect of binary cross-entropy criterion. Therefore, we design a foreground recalibration mechanism to enhance the intra-class consistency and highlight the positive moments. We evaluate our method on three benchmark datasets i.e., TACoS, Charades-STA and ActivityNet Captions, achieving state-of-the-art performance without any post-processing. In particular, we reach 32.17%, 45.11% and 43.76% under the metric Rank@1, IoU@0.5 on TACoS, Charades-STA and ActivityNet Captions, respectively. Furthermore, ablation studies were performed to show the effectiveness of individual components in our proposed method. We hope that the proposed method can serve as a strong and simple alternative for fine-grained video retrieval.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Context-aware network with foreground recalibration for grounding natural language in video

Abstract

Talk to us

Similar Papers

More From: Neural Computing and Applications

Lead the way for us

Journal: Neural Computing and Applications	Publication Date: Feb 26, 2021
Citations: 2

Similar Papers

Multi-Scale 2D Temporal Adjacency Networks for Moment Localization With Natural Language.
Songyang Zhang ... Houwen Peng
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 44
Songyang Zhang, et. al.Songyang Zhang ... Houwen Peng
01 Dec 2022
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 44

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language
Songyang Zhang ... Jiebo Luo
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Songyang Zhang, et. al.Songyang Zhang ... Jiebo Luo
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Language-Guided Multi-Granularity Context Aggregation for Temporal Sentence Grounding
Guoqiang Gong ... Linchao Zhu
IEEE Transactions on Multimedia | VOL. 25
Guoqiang Gong, et. al.Guoqiang Gong ... Linchao Zhu
01 Jan 2023
IEEE Transactions on Multimedia | VOL. 25

Multi-modal Dense Video Captioning
Vladimir Lashin ... Esa Rahtu
-
Vladimir Lashin, et. al.Vladimir Lashin ... Esa Rahtu
01 Jun 2020
01 Jun 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Context-aware network with foreground recalibration for grounding natural language in video

Abstract

Talk to us

Similar Papers

More From: Neural Computing and Applications