Abstract

With the growth of multimedia data, the prob- lem of cross-media (or cross-modal) retrieval has attracted considerable interest in the cross-media retrieval community. One of the solutions is to learn a common representation for multimedia data. In this paper, we propose a simple but effective deep learning method to address the cross-media retrieval problem between images and text documents for samples either with single or multiple labels. Specifically, two independent deep networks are learned to project the input feature vectors of images and text into an common (isomorphic) semantic space with high level abstraction (semantics). With the same dimensional feature representation in the learned common semantic space, the similarity between images and text documents can be directly measured. The correlation between two modalities is built according to their shared ground truth probability vector. To better bridge the gap between the images and the corresponding semantic concepts, an open-source CNN implementation called Deep Convolutional Activation Feature (DeCAF) is employed to extract input visual features for the proposed deep network. Extensive experiments on two publicly available multi-label datasets, NUS-WIDE and PASCAL VOC 2007, show that the proposed method achieves better results in cross-media retrieval compared with other state of the art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call