Recently, weakly supervised temporal sentence grounding in videos (TSGV) has attracted extensive attention because it does not require precise start-end time annotations during training, and it can quickly retrieve interesting segments according to user needs. In weakly supervised TSGV, query reconstruction (QR)-based methods are the current mainstream, and the quality of proposals determines their performance. QR-based methods have two problems in proposal quality. First, a multi-modal global token is usually mapped to proposals with limited duration diversity, making it difficult to capture relevant segments at varying durations in real scenarios. Additionally, Gaussian functions are typically used to generate relatively fixed weights for frames within proposals, which weigh the original video features to generate proposal-specific features. This results in query-irrelevant frames affecting the discrimination of the proposal features. In this study, we propose a query-aware multi-scale proposal network (QMN). Initially, pre-trained encoders are used to extract video and query features. Subsequently, a multi-scale proposal generation module is designed to refine video features guided by queries and diversify the duration of the proposal. This module performs multi-modal interaction and multi-scale modeling to obtain proposals of different durations. Furthermore, to extract discriminative proposal features and enhance the modeling of proposal frame correlation, a query-aware weight generator is constructed to learn frame weights to suppress query-irrelevant frame representations through contrastive learning. Finally, the masked query is reconstructed using the proposal features to select the best proposal. The effectiveness of the proposed QMN is verified through experiments on the Charades-STA and ActivityNet-Captions datasets.
Read full abstract