SgLFT: Semantic-guided Late Fusion Transformer for video corpus moment retrieval

Tongbao Chen,Wenmin Wang,Minglu Zhao,Ruochen Li,Zhe Jiang,Cheng Yu

doi:10.1016/j.neucom.2024.128029

Abstract

Video corpus moment retrieval aims to locate video segments that semantically correspond to natural language queries in a large collection. Effective representation learning across the video, subtitle, and language modalities is crucial for success. However, existing methods prematurely combine visual and subtitle features, leading to redundant information that hinders language query matching. Moreover, these methods lack query guidance during visual encoding, overlooking fine-grained semantics. To address these limitations, we propose a simple yet effective approach called Semantic-guided Late Fusion Transformer (SgLFT) for video corpus moment retrieval. Our method leverages a Semantic-guided Locality-aware Transformer to capture fine-grained visual embeddings using distinguishable query semantics, independent of the presence of subtitles. Furthermore, we introduce a late fusion approach that merges subtitle and visual features under a Cross-modality Global-aware Context Fusion unit, enhancing global contextual details. Finally, a Query-aware Feature Learning module aligns the query and video into a unified representation for localization. Our framework effectively models fine-grained semantic interactions between modalities with query guidance, advancing cross-modal representation learning. Extensive experiments on the TVR and DiDeMo benchmarks demonstrate that SgLFT significantly outperforms previous state-of-the-art methods.

Full Text