Abstract

Nowadays massive amount of images and texts has been emerging on the Internet, arousing the demand of effective cross-modal retrieval. To eliminate the heterogeneity between the modalities of images and texts, the existing subspace learning methods try to learn a common latent subspace under which cross-modal matching can be performed. However, these methods usually require fully paired samples (images with corresponding texts) and also ignore the class label information along with the paired samples. Indeed, the class label information can reduce the semantic gap between different modalities and explicitly guide the subspace learning procedure. In addition, the large quantities of unpaired samples (images or texts) may provide useful side information to enrich the representations from learned subspace. Thus, in this work we propose a novel model for cross-modal retrieval problem. It consists of (1) a semi-supervised coupled dictionary learning step to generate homogeneously sparse representations for different modalities based on both paired and unpaired samples; (2) a coupled feature mapping step to project the sparse representations of different modalities into a common subspace defined by class label information to perform cross-modal matching. We conducted extensive experiments on three benchmark datasets with fully paired setting, and a large-scale real-world web dataset with partially paired setting. The results well demonstrate the effectiveness and reasonableness of the proposed method in performing cross-modal retrieval tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call