Abstract

Because of the rapid growth of multimodal data from the internet and social media, a cross-modal retrieval has become an important and valuable task in recent years.The purpose of the cross-modal retrieval is to obtain the result data in one modality (e.g., image), which is semantically similar to the query data in another modality (e.g., text).In the field of remote sensing, despite a great number of existing works on image retrieval, there has only been a small amount of research on the cross-modal image-text retrieval, due to the scarcity of datasets and the complicated characteristics of remote sensing image data. In this article, we introduce a novel cross-modal image-text retrieval network to establish the direct relationship between remote sensing images and their paired text data. Specifically, in our framework, we designed a semantic alignment module to fully explore the latent correspondence between images and text, in which we used the attention and gate mechanisms to filter and optimize data features so that more discriminative feature representations can be obtained. Experimental results on four benchmark remote sensing datasets, including UCMerced-LandUse-Captions, Sydney-Captions, RSICD, and NWPU-RESISC45-Captions, well showed that our proposed method outperformed other baselines and achieved the state-of-the-art performance in remote sensing image-text retrieval tasks.

Highlights

  • W ITH the rapid development of Earth observation technology, the quantity and quality of remote sensing data have increased rapidly

  • We show the whole structure of our proposed deep image–text semantic alignment network in Fig. 2, which mainly includes the following three parts: 1) extraction of remote sensing image features; 2) extraction of text features; and 3) an semantic alignment module (SAM)

  • In order to prove the effectiveness of our proposed method, we evaluate our SAM on four public datasets: UCMerced-LandUseCaptions, Sydney-Captions, RSICD, and NWPU-RESISC45Captions

Read more

Summary

Introduction

W ITH the rapid development of Earth observation technology, the quantity and quality of remote sensing data have increased rapidly. Researches on the remote sensing image retrieval task [1]–[6]. Instead of retrieving in unimodal data, people are more inclined to search for the required information in multimodal data with richer semantics. Cross-modal retrieval technology can mine effective information and has broad application prospects in many fields, such as early warning of disasters and resource management. Satisfactory accuracy has been observed in the cross-modal retrieval of natural images [7]–[9]. It is difficult to implement an effective and efficient cross-modal retrieval of remote sensing images since these images have complicated characteristics such as multiscale, small targets, high resolution, and lack of annotated information

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call