Point-Supervised Video Temporal Grounding

Zhe Xu,Cheng Deng,Kun Wei,Xu Yang

doi:10.1109/tmm.2022.3205404

Abstract

Given an untrimmed video and a language query, Video Temporal Grounding (VTG) aims to locate the time interval in the video semantically relevant to the query. Existing fully-supervised VTG methods require accurate annotations of temporal boundary, which is time-consuming and expensive to obtain. On the other hand, weakly-supervised VTG methods where only paired videos and queries are available during training lag far behind the fully-supervised ones. In this paper, we introduce point supervision to narrow the performance gap with affordable annotating cost and propose a novel method dubbed Point-Supervised Video Temporal Grounding (PS-VTG). Specifically, an attention-based grounding network is first employed to obtain a language activation sequence (LAS). Then pseudo segment-level label is generated based on the LAS and the given point supervision to assist the training process. In addition, multi-level distribution calibration and cross-modal contrast are framed to obtain discriminative feature representations and precisely highlight the language-relevant video segments. Experiments on three benchmarks demonstrate that our method trained with point supervision can significantly outperform weakly-supervised approaches and achieve comparable performance with fully-supervised ones.

Full Text