Abstract

Temporal sentence grounding in videos (TSGV), which aims to retrieve video segments from an untrimmed videos that semantically match a given query. Most previous methods focused on learning either local or global query features and then performed cross-modal interaction, but ignore the complementarity between local and global features. In this paper, we propose a novel Multi-Level Interaction Network for Temporal Sentence Grounding in Videos. This network explores the semantics of queries at both phrase and sentence levels, interacting phrase-level features with video features to highlight video segments relevant to the query phrase and sentence-level features with video features to learn more about global localization information. A stacked fusion gate module is designed, which effectively captures the temporal relationships and semantic information among video segments. This module also introduces a gating mechanism to enable the model to adaptively regulate the fusion degree of video features and query features, further improving the accuracy of predicting the target segments. Extensive experiments on the ActivityNet Captions and Charades-STA benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.