Abstract

Cross-modal remote sensing retrieval (RSCR) has an increasing importance due to the ability to quickly and flexibly retrieve valuable data from enormous remote sensing (RS) images. However, traditional RSCR methods tend to focus on the retrieval between two modalities, when the number of modalities increases, the contradiction between the increasing semantic gap and the small amount of paired data causes the model fail to learn a superior modal representation. In this paper, inspired by the visual-based modal center in RS, we construct a multi-source cross-modal retrieval network (MCRN) that manages to unify RS retrieval tasks under multiple retrieval sources. To solve the data heterogeneity caused by multiple data sources, we propose a shared pattern transfer module (SPTM) based on pattern memory and combine the theory of generative adversarial to achieve the semantic representation unbound from modality. Simultaneously, to cope with the lack of annotation data in the RS scenario, multiple unimodal self-supervised frameworks are unified to obtain robust pre-training parameters for the designed MCRN by combining domain alignment and contrastive learning. Finally, we come up with the multi-source triplet loss, the unimodal contrast loss, and the semantic consistency loss, which efficaciously make MCRN achieve competitive results through multitask learning for semantic alignment. We construct multimodal datasets M-RSICD and M-RSITMD, conduct extensive experiments and provide a complete benchmark to facilitate the development of RS multi-source cross-modal retrieval. The code of the MCRN method and the proposed dataset have been open to access at [Link] . • We construct a visual-based multi-source cross-modal retrieval network that manages to unify RS retrieval tasks under multiple retrieval sources. • To address semantic heterogeneity among multiple data sources, we propose a shared pattern transfer module based on pattern memorizers and combine the theory of generative adversarial to achieve the semantic representation unbound from modality. • To cope with the lack of annotation data in the RS scene, we construct an unified unimodal self-supervised pre-training method, and align semantics under different modalities through the constructed multi-modal RS dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.