Abstract

Precise video moment retrieval is crucial for enabling users to locate specific moments within a large video corpus. This paper presents Interactive Moment Localization with Multimodal Fusion (IMF-MF), a novel interactive moment localization with multimodal fusion model that leverages the power of self-attention to achieve state-of-the-art performance. IMF-MF effectively integrates query context and multimodal features, including visual and audio information, to accurately localize moments of interest. The model operates in two distinct phases: feature fusion and joint representation learning. The first phase dynamically calculates fusion weights for adapting the combination of multimodal video content, ensuring that the most relevant features are prioritized. The second phase employs bi-directional attention to tightly couple video and query features into a unified joint representation for moment localization. This joint representation captures long-range dependencies and complex patterns, enabling the model to effectively distinguish between relevant and irrelevant video segments. The effectiveness of IMF-MF is demonstrated through comprehensive evaluations on three benchmark datasets: TVR for closed-world TV episodes and Charades for open-world user-generated videos, DiDeMo dataset, Open-world, diverse video moment retrieval dataset. The empirical results indicate that the proposed approach surpasses existing state-of-the-art methods in terms of retrieval accuracy, as evaluated by metrics like Recall (R1, R5, R10, and R100) and Intersection-of-Union (IoU). The results consistently demonstrate IMF-MF’s superior performance compared to existing state-of-the-art methods, highlighting the benefits of its innovative interactive moment localization approach and the use of self-attention for feature representation and attention modeling.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call