Abstract

The video moment retrieval task aims to fetch a target moment in an untrimmed video, which best matches the semantics of a sentence query. Existing methods mainly focus on utilizing two separate modules: one learns intra-modal relations to understand video and query contents, and the other explores inter-modal interactions to build a semantic bridge between video and language. However, intra-modal relations information can be easily overlooked when capturing inter-modal interactions. In fact, intra-modal relations and inter-modal interactions can be learned simultaneously within a unified module to make video and sentence guide each other. Towards this end, we propose a Cross-Modal Interaction Network (CMIN) for video moment retrieval by jointly exploring the intra-modal relations and inter-modal interactions between video frames and query words. In CMIN, a query-guided channel attention module is designed to suppress query-irrelevant visual features and enhance crucial contents; then a cross-attention module simultaneously considers intra-modal relations within each modality and fine-grained inter-modal interactions between frames and words, to enhance the semantic relevance between video and sentence query. Compared to the state-of-the-art methods, the experiments on two public datasets (Charades-STA and TACoS) demonstrate the superiority of our method.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.