A Unified Perspective of Multi-level Cross-Modal Similarity for Cross-Modal Retrieval

Yingying Huang,Yipeng Zhang,Bingliang Hu,Quan Wang

doi:10.1109/icicsp55539.2022.10050678

Abstract

Cross-modal retrieval is an intelligent understanding task between cross-modal data, and it comes with challenges to measure the similarity between cross-modal data. Existing methods mainly learned a common space by feature-wise or label-based supervised learning. Still, feature-wise methods only focused on the interactions between pairs of cross-modal data and label-based supervised learning relied excessively on classification accuracy. In the same space, these methods cannot capture more comprehensive interaction between cross-mode data, that is, given a query, this query and the retrieved data exist one-to-many correspondence, and the similarity between the pair-wise data is the largest. Therefore, a unified perspective of multi-level cross-modal similarity (MCMS) is proposed for cross-modal retrieval. Core ideas of MCMS are as follows: 1) The local similarity between cross-modal data is integrated to enrich the fine-grained cross-modal information. 2) The similarity between common feature vector and label is designed to obtain one-to-many correspondences between cross-modal data. In addition, Normalize Discounted Cumulative Gain (NDCG) as the evaluation metric is first used to comprehensively evaluate the results of cross-modal retrieval. Extensive experiments demonstrate that MCMS has better performance in cross-modal retrieval tasks.

Full Text