Abstract

Video corpus moment retrieval aims to locate video segments that semantically correspond to natural language queries in a large collection. Effective representation learning across the video, subtitle, and language modalities is crucial for success. However, existing methods prematurely combine visual and subtitle features, leading to redundant information that hinders language query matching. Moreover, these methods lack query guidance during visual encoding, overlooking fine-grained semantics. To address these limitations, we propose a simple yet effective approach called Semantic-guided Late Fusion Transformer (SgLFT) for video corpus moment retrieval. Our method leverages a Semantic-guided Locality-aware Transformer to capture fine-grained visual embeddings using distinguishable query semantics, independent of the presence of subtitles. Furthermore, we introduce a late fusion approach that merges subtitle and visual features under a Cross-modality Global-aware Context Fusion unit, enhancing global contextual details. Finally, a Query-aware Feature Learning module aligns the query and video into a unified representation for localization. Our framework effectively models fine-grained semantic interactions between modalities with query guidance, advancing cross-modal representation learning. Extensive experiments on the TVR and DiDeMo benchmarks demonstrate that SgLFT significantly outperforms previous state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.