Few-Shot Temporal Sentence Grounding via Memory-Guided Semantic Learning

Daizong Liu,Pan Zhou,Haozhao Wang,Ruixuan Li,Zichuan Xu

doi:10.1109/tcsvt.2022.3223725

Daizong Liu, Pan Zhou + Show 3 more

https://doi.org/10.1109/tcsvt.2022.3223725

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Temporal sentence grounding (TSG) is an important yet challenging task in video-based information retrieval. Given an untrimmed video input, it requires the machine to predict the interested video segment semantically related to a given sentence query. Most existing TSG methods train well-designed deep networks to align the semantic between video-query pairs for activity grounding with a large amount of data. However, we argue that these works easily capture the selection biases of video-query pairs in a dataset rather than showing the robust reasoning abilities to handle the rarely appeared pairs ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e</i> ., few-shot contents). To alleviate such limitation of the off-balance data distribution during the network training, in this paper, we propose a novel memory-augmented network called Memory-Guided Semantic Learning Network (MGSL-Net) to handle the few-shot TSG task for enhancing the model generalization ability. Specifically, given the matched video-query input, we first employ a graph attentive cross-modal interaction module to align their semantics in a cycle-consistent manner. Then, we develop the memory modules in both video and query domains to record the cross-modal shared semantic features in the domain-specific persistent memory. At last, a heterogeneous attention module is utilized to integrate the memory-enhanced multi-modal features in both video and query domains with further feature calibration. During training, the memory modules are dynamically associated with both common and rare cases to memorize all appeared contents, alleviating the issue of forgetting the few-shot contents. Therefore, in testing, the rare cases can be enhanced by retrieving the stored memories, improving the generalization ability of the model. Experimental results on three benchmarks (ActivityNet Caption, Charades-STA and TACoS) show the superiority of our method on both effectiveness and efficiency.

Full Text