Abstract

We address the problem of video-text retrieval that searches videos via natural language description or vice versa. Most state-of-the-art methods only consider cross-modal learning for two or three data points in isolation, ignoring to get benefit from the structural information of other data points from a global view. In this paper, we propose to exploit the comprehensive relationships among cross-modal samples via Graph Neural Networks (GNN). To improve the discriminative ability for accurately finding the positive sample, a Coarse-to-Fine GNN is constructed, which can progressively optimize the retrieval results via multi-step reasoning. Specifically, we first adopt heuristic edge features to represent relationships. Then we design a scoring module in each layer to rank the edges connected to the query node and drop the edges with lower scores. Finally, to alleviate the class imbalance issue, we propose a random-drop focal loss to optimize the whole framework. Extensive experimental results show that our method consistently outperforms the state-of-the-arts on four benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call