Cross-Modal Learning with Images, Texts and Their Semantics

Xing Xu

doi:10.1007/978-3-319-46245-5_10

Abstract

Nowadays massive amount of images and texts has been emerging on the Internet, arousing the demand of effective cross-modal retrieval. To eliminate the heterogeneity between the modalities of images and texts, the existing subspace learning methods try to learn a common latent subspace under which cross-modal matching can be performed. However, these methods usually require fully paired samples (images with corresponding texts) and also ignore the class label information along with the paired samples. Indeed, the class label information can reduce the semantic gap between different modalities and explicitly guide the subspace learning procedure. In addition, the large quantities of unpaired samples (images or texts) may provide useful side information to enrich the representations from learned subspace. Thus, in this work we propose a novel model for cross-modal retrieval problem. It consists of (1) a semi-supervised coupled dictionary learning step to generate homogeneously sparse representations for different modalities based on both paired and unpaired samples; (2) a coupled feature mapping step to project the sparse representations of different modalities into a common subspace defined by class label information to perform cross-modal matching. We conducted extensive experiments on three benchmark datasets with fully paired setting, and a large-scale real-world web dataset with partially paired setting. The results well demonstrate the effectiveness and reasonableness of the proposed method in performing cross-modal retrieval tasks.

Full Text