Video moment retrieval locates a specified moment by a sentence query. Recent approaches have made remarkable advancements with large-scale video-sentence annotations. These annotations require extensive human labor and expertise, leading to the need for unsupervised fashion. Generating pseudo-supervision from videos is an effective strategy. With the power of the large-scale pre-trained model, we introduce knowledge into constructing pseudo-supervision. The main technical challenge is improving pseudo-supervision diversity and alleviating noise brought by external knowledge. To address these problems, we propose two Knowledge-based Pseudo Supervision Construction (KPSC) strategies: KPSC-P and KPSC-F. They all follow two steps: generating diverse samples and alleviating knowledge chaos. The main difference is that the former first learns a representation space with prompt tuning, while the latter directly utilizes data information. KPSC-P has two modules: 1) Proposal Prompt (PP): Generate temporal proposals; 2) Verb Prompt (VP): Generate pseudo-queries with noun-verb patterns. KPSC-F also has two modules: 1) Captioner: Generating candidate queries; 2) Filter: Alleviating knowledge chaos. Thus, our KPSC involves two attempts to extract knowledge from pre-trained models. Extensive experiments show that our attempts outperform the existing unsupervised methods on two public datasets (Charades-STA and ActivityNet-Captions) and perform on par with several methods using stronger supervision.
Read full abstract