Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Shanshan Qi,Luxi Yang,Yongming Huang,Chunguo Li

doi:10.1109/access.2021.3095229

Abstract

Temporal sentence grounding aims to ground a query sentence into a specific segment of the video. Previous methods follow the common equally-spaced frame selection mechanism for appearance and motion modeling, which fails to consider redundant and distracting visual information. There is also no guarantee that all meaningful frames can be obtained. Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatial-temporal semantic information and query sentence is still unexplored in existing methods. Inspired by human thinking patterns, we propose a Coarse-to-Fine Spatial-Temporal Relationship Inference (CFSTRI) network to progressively localize fine-grained activity segments. Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant to the sentence semantics, and the soft assignment vector of locally aggregated descriptors are employed to enhance the representation of selected frames. Then, we develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic information from query sentence to guide the excavation of visual grounding clues of corresponding dimensions. Furthermore, we devise a gated graph convolution network to incorporate the spatial-temporal semantic information by leveraging a gate operation to highlight frames referred to by the query sentence from spatial and temporal dimensions, and propagate fused information on the graph. Extensive experiments on two benchmark datasets demonstrate that our CFSTRI significantly outperforms most state-of-the-art methods.

Highlights

G ROUNDING temporal activities in videos [1]–[6] is a fundamental and challenging task in multimedia information retrieval
We devise a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames provides the significant cues to identify all the coarse boundary locations related to the query sentence and suppress the distracting ones
We present a coarse-grained crucial frame selection module, which leverages the query-guided local difference attention pooling to roughly distinguish the sentence-relevant boundary locations from irrelevant ones and applies the soft assignment vector of locally aggregated descriptors (SA-VLAD) encoding to enhance the representation of selected frames

Summary

INTRODUCTION

G ROUNDING temporal activities in videos [1]–[6] is a fundamental and challenging task in multimedia information retrieval. We devise a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames provides the significant cues to identify all the coarse boundary locations related to the query sentence and suppress the distracting ones This basic idea [20] originates from the fact that the video is temporally continuous and highly correlated, where the adjacent frames at the boundary positions of segments possess diverse visual appearances, and the adjacent frames inside each segment tend to contain similar ones. We develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic from the query sentence to guide the excavation of visual grounding clues of corresponding dimensions.

TEMPORAL SENTENCE GROUNDING

GRAPH-BASED REASONING

PROBLEM FORMULATION

SPARSE SYNTACTIC GRAPH CONSTRUCTION

COARSE-GRAINED CRUCIAL FRAME SELECTION

FINE-GRAINED SPATIAL-TEMPORAL RELATIONSHIP MATCHING MODULE

LOSS FUNCTION FOR MODEL TRAINING

EVALUATION METRIC

PERFORMANCE COMPARISONS

Findings

Method

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Semantic Proposal for Activity Localization in Videos via Sentence Query
Shaoxiang Chen ... Yu-Gang Jiang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 33
Shaoxiang Chen, et. al.Shaoxiang Chen ... Yu-Gang Jiang
17 Jul 2019
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 33

Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary
Jiachang Hao ... Jingyu Wang
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 19
Jiachang Hao, et. al.Jiachang Hao ... Jingyu Wang
30 Sep 2022
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 19

Sentence Specified Dynamic Video Thumbnail Generation
Yitian Yuan ... Wenwu Zhu
-
Yitian Yuan, et. al.Yitian Yuan ... Wenwu Zhu
15 Oct 2019
15 Oct 2019

Self-Learning Video Rain Streak Removal: When Cyclic Consistency Meets Temporal Correspondence
Wenhan Yang ... Shiqi Wang
-
Wenhan Yang, et. al.Wenhan Yang ... Shiqi Wang
01 Jun 2020
01 Jun 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access