Language-Guided Multi-Granularity Context Aggregation for Temporal Sentence Grounding

Guoqiang Gong,Yadong Mu,Linchao Zhu

doi:10.1109/tmm.2022.3222664

Abstract

Temporal sentence grounding in videos is a crucial task in vision-language learning. Its goal is retrieving a video segment from an untrimmed video that semantically corresponds to a natural language query. A video usually contains multiple semantic events, which are rarely isolated. They tend to be temporally ordered and semantically correlated ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">e.g.</i> , some event is often the precursor of another event). To precisely localize a semantic moment from a video, it is critical to effectively extract and aggregate multi-granularity contextual information, including the fine-grained local context around the moment-related video segment (in short snippet-level) and coarse-grained semantic correlation (in segment-level). Additionally, a second main insight in this work is that the above context aggregation should be favorably guided by the queries, rather than fully query-agnostic. Putting above ideas together, we here present a new network that does language-guided multi-granularity context aggregation. It is comprised of two major modules. The core of the first module is a novel language-guided temporal adaptive convolution (LTAC) devised to extract fine-grained information over video snippets around the ground-truth video segment. It decomposes a convolution into two channel-oriented / temporal-oriented ones. In particular, the convolutional channels are supposed to be more susceptible to queries, thus we learn to generate a dynamic channel-oriented kernel with respect to the querying sentence. As a second module, we propose a language-guided global relation block (LGRB) that extracts video-level context. It augments the contextual feature by using a multi-scale temporal attention that tackles the scale variation of ground-truth video segments, and a multi-modal semantic attention that relies on syntactic of the query. For the validation purpose, we have conducted comprehensive experiments on two popularly-adopted video benchmarks ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , ActivityNet Captions and Charades-STA). All experimental results and ablation studies have clearly corroborated the effectiveness of our model designs, outstripping prior state-of-the-art methods in terms of major performance metrics for the task.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Language-Guided Multi-Granularity Context Aggregation for Temporal Sentence Grounding

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia

Lead the way for us

Journal: IEEE Transactions on Multimedia	Publication Date: Jan 1, 2023
Citations: 4

Similar Papers

Context-aware network with foreground recalibration for grounding natural language in video
Cheng Chen ... Xiaodong Gu
Neural Computing and Applications | VOL. 33
Cheng Chen, et. al.Cheng Chen ... Xiaodong Gu
26 Feb 2021
Neural Computing and Applications | VOL. 33

Multi-Level interaction network for temporal sentence grounding in videos
Guangli Wu ... Zhijun Yang
Journal of Intelligent & Fuzzy Systems | VOL. 46
Guangli Wu, et. al.Guangli Wu ... Zhijun Yang
18 Apr 2024
Journal of Intelligent & Fuzzy Systems | VOL. 46

Bidirectional Single-Stream Temporal Sentence Query Localization in Untrimmed Videos
Cheng Li ... Shihao Peng
-
Cheng Li, et. al.Cheng Li ... Shihao Peng
01 Sep 2019
01 Sep 2019

Human Action Recognition by Discriminative Feature Pooling and Video Segment Attention Model
Md Moniruzzaman ... Zhaozheng Yin
IEEE Transactions on Multimedia | VOL. 24
Md Moniruzzaman, et. al.Md Moniruzzaman ... Zhaozheng Yin
10 Feb 2021
IEEE Transactions on Multimedia | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Language-Guided Multi-Granularity Context Aggregation for Temporal Sentence Grounding

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia