Abstract

In the text-video retrieval task, the objective is to calculate the similarity between a text and a video, and rank the relevant candidates higher. Most existing methods only consider the text-video semantic alignment in the global view. But using mean-pooling to obtain global semantics and simply aligning text and video in the global view may lead to semantic bias. In addition, some methods utilize offline object detectors or sentence parsers to obtain entity-level information in text and video and achieve local alignment. However, inaccurate detection introduces possible errors and such approaches prevent models from being trained end-to-end for retrieval. To overcome these limitations, we propose multi-grained and semantic-guided alignment for text-video retrieval in this paper, which can achieve fine-grained alignment based on video frames and text words, local alignment based on semantic centers, and global alignment. Specially, we explore summary semantics of text and video to guide the local alignment based on semantic centers for we believe that the importance of each semantic center is determined by summary semantics. We evaluate our approach on four benchmark datasets of MSRVTT, MSVD, ActivityNet Captions, and DiDeMo, achieving better performance than most existing methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call