Abstract

With the rapid growth of multimodal web data, the task of cross-modal retrieval, i.e., using a text query to search for images or vice versa, has attracted a lot of attention from researchers. Existing approaches usually learn a common representation space where different modalities can be directly compared. However, little work has been done to verify that the learned common representation space contains only common part shared between different modalities. In this paper, we present a coordinated and specific restricted Boltzmann machine (a.k.a. CSRBM) that can distinguish the common part from modality-specific part of different modalities. The proposed CSRBM consists of two RBMs, each with two hidden layers. The common hidden layer learns the common patterns shared within different modalities. And the modality-specific hidden layer learns the modality-specific patterns owned by individual modalities. To verify the split effectiveness of our proposed model, we construct a multimodal dataset based on the popular MNIST dataset. Moreover, we evaluate our model on three publicly real-world datasets with the task of cross-modal retrieval. The extensive experiments demonstrate the effectiveness of our CSRBM.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call