Abstract

Temporal language localization in videos aims to retrieve the moment that best matches the text description in the untrimmed video using the query text. Existing methods using graph convolutional networks have been effective in feature representation and cross-modal interaction, but the existing methods do not consider the sparsity constraint of the graph when constructing the graph structure, which can easily cause an increase in computational cost and introduce redundant connections that may adversely affect the accuracy of the results. Therefore, we propose a novel sparse graph matching network for temporal language localization in videos. Specifically, we use graph convolutional networks to learn video features and dynamically construct video graph with constraints on sparsity and connectivity; the complementarity between the sequential context and the syntactic structure of text is used to model the semantic features of text. For cross-modal interaction, we design a sparse graph matching method based on affinity matrix to match video and text graphs, and align cross-modal semantic features. Finally, by fusing the features of the two modalities, candidate moments are generated, and their confidence scores are calculated to locate the moment matching the query. Experimental results on the public datasets TACoS, ActivityNet Caption, and QVHighlights demonstrate the superiority of our method compared to state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call