Abstract

Temporal sentence grounding aims to ground a query sentence into a specific segment of the video. Previous methods follow the common equally-spaced frame selection mechanism for appearance and motion modeling, which fails to consider redundant and distracting visual information. There is also no guarantee that all meaningful frames can be obtained. Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatial-temporal semantic information and query sentence is still unexplored in existing methods. Inspired by human thinking patterns, we propose a Coarse-to-Fine Spatial-Temporal Relationship Inference (CFSTRI) network to progressively localize fine-grained activity segments. Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant to the sentence semantics, and the soft assignment vector of locally aggregated descriptors are employed to enhance the representation of selected frames. Then, we develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic information from query sentence to guide the excavation of visual grounding clues of corresponding dimensions. Furthermore, we devise a gated graph convolution network to incorporate the spatial-temporal semantic information by leveraging a gate operation to highlight frames referred to by the query sentence from spatial and temporal dimensions, and propagate fused information on the graph. Extensive experiments on two benchmark datasets demonstrate that our CFSTRI significantly outperforms most state-of-the-art methods.

Highlights

  • G ROUNDING temporal activities in videos [1]–[6] is a fundamental and challenging task in multimedia information retrieval

  • We devise a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames provides the significant cues to identify all the coarse boundary locations related to the query sentence and suppress the distracting ones

  • We present a coarse-grained crucial frame selection module, which leverages the query-guided local difference attention pooling to roughly distinguish the sentence-relevant boundary locations from irrelevant ones and applies the soft assignment vector of locally aggregated descriptors (SA-VLAD) encoding to enhance the representation of selected frames

Read more

Summary

INTRODUCTION

G ROUNDING temporal activities in videos [1]–[6] is a fundamental and challenging task in multimedia information retrieval. We devise a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames provides the significant cues to identify all the coarse boundary locations related to the query sentence and suppress the distracting ones This basic idea [20] originates from the fact that the video is temporally continuous and highly correlated, where the adjacent frames at the boundary positions of segments possess diverse visual appearances, and the adjacent frames inside each segment tend to contain similar ones. We develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic from the query sentence to guide the excavation of visual grounding clues of corresponding dimensions.

TEMPORAL SENTENCE GROUNDING
GRAPH-BASED REASONING
PROBLEM FORMULATION
SPARSE SYNTACTIC GRAPH CONSTRUCTION
COARSE-GRAINED CRUCIAL FRAME SELECTION
FINE-GRAINED SPATIAL-TEMPORAL RELATIONSHIP MATCHING MODULE
LOSS FUNCTION FOR MODEL TRAINING
EVALUATION METRIC
PERFORMANCE COMPARISONS
Findings
Method
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.