Abstract

Remote sensing (RS) images are widely used in civilian and military fields. With the highly increasing image data, it has become a challenging issue to achieve fast and efficient RS image retrieval. However, the existing image retrieval methods, text-based or content-based, are still limited in the applications; for example, text input is inefficient, and the sample image for query is often unavailable. It is known that speech is a natural and convenient way of communication. Therefore, a novel speech-image cross-modal retrieval approach, named deep visual-audio network (DVAN), is presented in this article, which can establish the direct relationship between image and speech from paired image-audio data. The model mainly has three parts: 1) Image feature extraction, which is used to extract effective features of RS images; 2) audio feature learning, which is used to recognizing key information from raw data, and AudioNet, as part of DVAN, is proposed to obtain more distinguishing features; 3) multimodal embedding, which is used to learn the direct correlations of two modalities. Experimental results on RS image audio dataset demonstrate that the proposed method is effective and speech-image retrieval is feasible, and it provides a new way for faster and more convenient RS image retrieval.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call