Abstract

The problem of video-text retrieval, which searches videos via natural language descriptions or vice versa, has attracted growing attention due to the explosive scale of videos produced every day. The dominant approaches for this problem follow the pipeline that firstly learn compact feature representations of videos and texts, and then jointly embed them into a common feature space where matched video-text pairs are close and unmatched pairs are far away. However, most of them neither consider the structural similarities among crossmodal samples in a global view, nor leverage useful information from other relevant retrieval processes. We argue that both information have great potential for video-text retrieval. In this paper, we propose to extract useful knowledge from the retrieval process by exploiting structural similarities via Graph Neural Networks (GNNs) and then progressively transfer useful knowledge from relevant retrieval processes in a general-tospecific manner to assist the current retrieval process. Specifically, for the retrieval of the current query, we first construct a sequence of query-graphs whose central queries are chosen from distant to close to the current query. Then we conduct knowledgeguided message passing in each query-graph to exploit regional structural similarities and gather knowledge of different levels from the updated query-graphs with a knowledge-based attention mechanism. Finally, we transfer the extracted useful knowledge from general to specific to assist the current retrieval process. Extensive experimental results show that our model outperforms the state-of-the-arts on four benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call