Cross-modal retrieval aims at retrieving highly semantic relevant information among multi-modalities. Existing cross-modal retrieval methods mainly explore the semantic consistency between image and text while rarely consider the rankings of positive instances in the retrieval results. Moreover, these methods seldom take into account the cross-interaction between image and text, which leads to the deficiency of learning their semantic relations. In this paper, we propose a Unified framework with Ranking Learning (URL) for cross-modal retrieval. The unified framework consists of three sub-networks, visual network, textual network, and interaction network. Visual network and textual network project the image feature and text feature into their corresponding hidden spaces respectively. Then, the interaction network forces the target image-text representation to align in the common space. For unifying both semantics and rankings, we propose a new optimization paradigm including pre-alignment for semantic knowledge transfer and ranking learning for final retrieval, which can decouple semantic alignment and ranking learning. The former focuses on the semantic pre-alignment optimized by semantic classification and the latter revolves around the retrieval rankings. For the ranking learning, we introduce a cross-AP loss which can directly optimize the retrieval metric average precision for cross-modal retrieval. We conduct experiments on four widely-used benchmarks, including Wikipedia dataset, Pascal Sentence dataset, NUS-WIDE-10k dataset, and PKU XMediaNet dataset respectively. Extensive experimental results show that the proposed method can obtain higher retrieval precision.
Read full abstract