Abstract

In this paper, a novel unsupervised Cross-Modal retrieval method via Sufficient Scene Representation (CMSSR) is proposed. Distinguished from the existing methods which mainly focus on simultaneously preserving the mutually-constrained intra- and inter-modal similarity relation, CMSSR considers data of different modalities as the descriptions of a scene from different views and accordingly integrates information of different modalities to learn a complete common representation containing sufficient information of the corresponding scene. To obtain such common representation, Gaussian Mixture Model (GMM) is firstly utilized to generate statistic representation of each uni-modal data, while the uni-modal spaces are accordingly abstracted as uni-modal statistical manifolds. In addition, the common space is assumed to be a high-dimensional statistical manifold with different uni-modal statistical manifolds as its sub-manifolds. In order to generate sufficient scene representation from uni-modal data, a representation completion strategy based on logistic regression is proposed to effectively complete the missing representation of another modality. Then, the similarity between different multi-modal data can be more accurately reflected by the distance metric in common statistical manifold. Based on the distance metric in common statistical manifold, Iterative Quantization is utilized to further generate binary code for fast cross-modal retrieval. Extensive experiments on three standard benchmark datasets fully demonstrate the superiority of CMSSR compared with several state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call