Abstract

Temporal sentence grounding aims to locate the moment most related to the given natural language query. Noticing the time-consuming labeling process of the temporal bounding boxes, recent works started to focus on the weakly supervised temporal sentence grounding (WTSG) with only video-text pairwise annotations. Existing WTSG methods mainly adopted anchor-based structure to generate moment candidates and trained the network with triplet loss between the positive and negative samples, and thus the network performance was seriously affected by the preset anchors and the loss margin. In this paper, we propose a novel contrastive generative adversarial learning method with adaptive box generation for WTSG tasks. Specifically, the temporal proposals are adaptively generated by a transformer-based box generator in a complete anchor-free manner. And a novel contrastive generative adversarial learning process is proposed for the network optimization, which can effectively encourages the separation of the positive and negative samples without preset margin value. Extensive experiments indicate that our method achieves the state-of-the-art performance on both of the Charades-STA and the ActivityNet Captions datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.