Abstract

Given an untrimmed video and a language query, Video Temporal Grounding (VTG) aims to locate the time interval in the video semantically relevant to the query. Existing fully-supervised VTG methods require accurate annotations of temporal boundary, which is time-consuming and expensive to obtain. On the other hand, weakly-supervised VTG methods where only paired videos and queries are available during training lag far behind the fully-supervised ones. In this paper, we introduce point supervision to narrow the performance gap with affordable annotating cost and propose a novel method dubbed Point-Supervised Video Temporal Grounding (PS-VTG). Specifically, an attention-based grounding network is first employed to obtain a language activation sequence (LAS). Then pseudo segment-level label is generated based on the LAS and the given point supervision to assist the training process. In addition, multi-level distribution calibration and cross-modal contrast are framed to obtain discriminative feature representations and precisely highlight the language-relevant video segments. Experiments on three benchmarks demonstrate that our method trained with point supervision can significantly outperform weakly-supervised approaches and achieve comparable performance with fully-supervised ones.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.