Abstract

This paper strives for temporal localization of actions in untrimmed videos via natural language queries. Prevailing methods represent both query sentence and video as a whole and perform sentence-video matching via global features, which neglects local correspondence between sentence and video. In this work, we aim to move beyond this limitation by delving into the fine-grained local sentence-video matching, such as phrase-motion matching and word-object matching. We propose a hierarchical matching and reasoning method based on deep conditional random field to integrate hierarchical matching between visual concepts and textual semantics for temporal action localization via query sentence. Our method decomposes each sentence into textual semantics (i.e., phrases and words), obtains multi-level matching results between the textual semantics and the visual concepts in a video (i.e., results of phrase-motion matching and word-object matching), and then reasons relations between multi-level matching via pairwise potentials of conditional random field to achieve coherence in hierarchical matching. By minimizing the overall potential, the final matching score between a sentence and a video is computed as the conditional probability of the conditional random field. Our proposed method is evaluated on public Charades-STA dataset and the experimental results verify its superiority over the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call