Abstract

Temporal sentence grounding in videos aims to localize one target video segment, which semantically corresponds to a given sentence. Unlike previous methods mainly focusing on matching semantics between the sentence and different video segments, in this paper, we propose a novel semantic conditioned dynamic modulation (SCDM) mechanism, which leverages the sentence semantics to modulate the temporal convolution operations for better correlating and composing the sentence-relevant video contents over time. The proposed SCDM also performs dynamically with respect to the diverse video contents so as to establish a precise semantic alignment between sentence and video. By coupling the proposed SCDM with a hierarchical temporal convolutional architecture, video segments with various temporal scales are composed and localized. Besides, more fine-grained clip-level actionness scores are also predicted with the SCDM-coupled temporal convolution on the bottom layer of the overall architecture, which are further used to adjust the temporal boundaries of the localized segments and thereby lead to more accurate grounding results. Experimental results on benchmark datasets demonstrate that the proposed model can improve the temporal grounding accuracy consistently, and further investigation experiments also illustrate the advantages of SCDM on stabilizing the model training and associating relevant video contents for temporal sentence grounding. Our code for this paper is available at https://github.com/yytzsy/SCDM-TPAMI.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call