A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing

Fuzhong Zheng,Weipeng Li,Xiong Zhang,Xu Wang,Haisu Zhang,Luyao Wang

doi:10.3390/app122312221

Fuzhong Zheng, Weipeng Li + Show 4 more

Open Access

https://doi.org/10.3390/app122312221

Copy DOI

Journal: Applied sciences	Publication Date: Nov 29, 2022
Citations: 5	License type: CC BY 4.0

Affiliation: National University of Defense Technology

Abstract

With the rapid development of remote sensing (RS) observation technology over recent years, the high-level semantic association-based cross-modal retrieval of RS images has drawn some attention. However, few existing studies on cross-modal retrieval of RS images have addressed the issue of mutual interference between semantic features of images caused by “multi-scene semantics”. Therefore, we proposed a novel cross-attention (CA) model, called CABIR, based on regional-level semantic features of RS images for cross-modal text-image retrieval. This technique utilizes the CA mechanism to implement cross-modal information interaction and guides the network with textual semantics to allocate weights and filter redundant features for image regions, reducing the effect of irrelevant scene semantics on retrieval. Furthermore, we proposed BERT plus Bi-GRU, a new approach to generating statement-level textual features, and designed an effective temperature control function to steer the CA network toward smooth running. Our experiment suggested that CABIR not only outperforms other state-of-the-art cross-modal image retrieval methods but also demonstrates high generalization ability and stability, with an average recall rate of up to 18.12%, 48.30%, and 55.53% over the datasets RSICD, UCM, and Sydney, respectively. The model proposed in this paper will be able to provide a possible solution to the problem of mutual interference of RS images with “multi-scene semantics” due to complex terrain objects.

Full Text