Temporal sentence grounding is a challenging task that aims to localize the semantic corresponding segment from the untrimmed video according to the given query language description. Existing methods either utilize a cross-modal matching architecture following a scan-and-rank pipeline or directly predict the probabilities of being the target boundary for each frame based on the entire video content. However, such methods are weak when some of the critical semantic concepts in the query are actually relevant to multiple video segments or the desired video segment contains a query-irrelevant scene due to ignoring query semantic concepts and local and global cross-modal context. In this paper, we propose a novel semantic-aware graph calibration network (SaGCN) to address the issues mentioned above. Specifically, we first introduce a semantic-aware local relational graph module to capture the inherent relationships among the specific semantic concept relevant local contextual information for fine-grained cross-modal information interactions. Then, a semantic-aware global relational graph module is derived for global contextual information integration and achieving cross-modal alignment. Finally, an attention-based calibration module is designed for eliminating the irrelevant information maintained in the visual modality under the guidance of query description. Extensive experiments verify the effectiveness of our proposed SaGCN on two widely used datasets (Charades-STA and TACoS), in which we achieve significant and consistent improvement compared to the state-of-the-art approaches.
Read full abstract