Abstract

This paper investigates the problem of modeling Internet images and associated text for cross-modal retrieval tasks such as text-to-image search, and image-to-text search. Canonical correlation analysis (CCA), a classic two view approach for mapping text and image into a common latent space, does not make use of the semantic information of text and image pairs. We use CCA to map text, image and semantic information into a common latent space, in which the correlation of the three views is maximized. To improve the performance of CCA, in this paper, 3view-Deep Canonical Correlation Analysis (3view-DCCA), a nonlinear expansion of CCA is proposed to learn the complex nonlinear transformations between the three views. Like most deep learning methods, DCCA is easy to over-fitting. To overcome over-fitting, we add the reconstruct loss of each view into the loss function, which include the correlation loss of every two views and regularization of parameters. Inspired by PageRank, we propose a search-based similarity method to score relevance. The proposed model (3view-DCCA) is evaluated on three publicly available data sets from real scenes. We demonstrate that our deep model performs significantly better than traditional canonical correlation analysis based models and several other deep learning models on cross-modal retrieval tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.