Abstract
With the rapid growth of multimodal web data, the task of cross-modal retrieval, i.e., using a text query to search for images or vice versa, has attracted a lot of attention from researchers. Existing approaches usually learn a common representation space where different modalities can be directly compared. However, little work has been done to verify that the learned common representation space contains only common part shared between different modalities. In this paper, we present a coordinated and specific restricted Boltzmann machine (a.k.a. CSRBM) that can distinguish the common part from modality-specific part of different modalities. The proposed CSRBM consists of two RBMs, each with two hidden layers. The common hidden layer learns the common patterns shared within different modalities. And the modality-specific hidden layer learns the modality-specific patterns owned by individual modalities. To verify the split effectiveness of our proposed model, we construct a multimodal dataset based on the popular MNIST dataset. Moreover, we evaluate our model on three publicly real-world datasets with the task of cross-modal retrieval. The extensive experiments demonstrate the effectiveness of our CSRBM.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.