Abstract
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes to event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs in language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system’s segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.