Abstract

Abstract We investigate the effectiveness of a successful model in Visual-Question-Answering (VQA) problems as the core component in a cross-modal retrieval system that can accept images or text as queries, in order to retrieve relevant data from a multimodal document collection. To this end, we adapt the VQA model for deep multimodal learning to combine visual and textual representations for information search, and we call this model “Deep Multimodal Embeddings (DME)”. Instead of training the model to answer questions, we supervise DME to classify semantic topics/concepts previously identified in the document collection of interest. In contrast to previous approaches, we found that this model can handle any multimodal query with a single architecture while producing improved or competitive results in all retrieval tasks. We evaluate the model performance with 3 widely known databases for cross-modal retrieval tasks: Wikipedia Retrieval Database, Pascal Sentences, and MIR-Flickr-25k. The results show that the DME model learns effective multimodal representations, resulting in strongly improved retrieval performance, specifically in the top results of the ranked list, which are the most important to users in the most-common scenarios of information retrieval. Our work represents a new baseline for a wide set of different methodologies for cross-modal retrieval.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call