Remote sensing cross-modal text-image retrieval constitutes a pivotal component of multi-modal retrieval in remote sensing, central to which is the process of learning integrated visual and textual representations. Prior research predominantly emphasized the overarching characteristics of remote sensing images, or employed attention mechanisms for meticulous alignment. However, these investigations, to some degree, overlooked the intricacies inherent in the textual descriptions accompanying remote sensing images. In this paper, we introduce a novel cross-modal retrieval model, specifically tailored for remote sensing image-text, leveraging attention correction and filtering mechanisms. The proposed model is architected around four primary components: an image feature extraction module, a text feature extraction module, an attention correction module, and an attention filtering module. Within the image feature extraction module, the Visual Graph Neural Network (VIG) serves as the principal encoder, augmented by a multi-tiered node feature fusion mechanism. This ensures a comprehensive understanding of remote sensing images. For text feature extraction, both the Bidirectional Gated Recurrent Unit (BGRU) and the Graph Attention Network (GAT) are employed as encoders, furnishing the model with an enriched understanding of the associated text. The attention correction segment minimizes potential misalignments in image-text pairings, specifically by modulating attention weightings in cases where there’s a unique correlation between visual area attributes and textual descriptors. Concurrently, the attention filtering segment diminishes the influence of extraneous visual sectors and terms in the image-text matching process, thereby enhancing the precision of cross-modal retrieval. Extensive experimentation carried out on both the RSICD and RSITMD datasets, yielded commendable results, attesting to the superior efficacy of the proposed methodology in the domain of remote sensing cross-modal text-image retrieval.
Read full abstract