Abstract

This paper considers the problem of cross-modal retrieval, e.g. using a text query to search for images and vice-versa. Existing approaches usually learn a common subspace where the shared parts of different modalities can be directly compared. However, no previous works explicitly show that the learned space contains only the common information but without the modality-specific information. And the division between these two types of information would benefits the task of cross-modal retrieval. In this paper, we present a COordinated and Specific autoEncoder (a.k.a. COSE) that can distinguish the common part from modality-specific part of different modalities. The proposed model COSE consists of two subnetworks, each with two representation layers. The common representation layer learns the common patterns shared within different modalities. And the modality-specific representation layer learns the modality-specific patterns owned by individual modalities. We evaluate our model on three publicly real-world datasets with the task of cross-modal retrieval. The extensive experiments demonstrate the effectiveness of our COSE.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call