Multi-Scale Interactive Transformer for Remote Sensing Cross-Modal Image-Text Retrieval

Yijing Wang,Mingteng Li,Licheng Jiao,Jingjing Ma,Xiao Han,Xu Tang

doi:10.1109/igarss46834.2022.9883252

Abstract

Cross-modal Remote sensing (RS) image-text retrieval (CMR-SITR) plays a crucial role in the RS community. A common way for CMRSITR is to extract texts and RS images' feature representations separately and then measure their similarities in the specific or common feature space. Recently, along with the booming of deep convolutional neural networks (DCNNs), these kinds of methods are vivid and achieve successes in their own applications. However, they neglect the inherent relationships between different features, and they are always heavy. To overcome the limitations mentioned above, we propose a new model for CMRSITR in this paper, named multi-scale interactive transformer (MSIT). MSIT first adopts simple feature learning models for texts and RS images which could ensure the whole model is not heavy. Then, MSIT introduces transformer encoders to enhance features' usefulness by considering the potential relations between different representations. Also, a lightweight multi-scale feature learning module is proposed to mine more plentiful contents from RS images. Finally, instead of outputting the features, MSIT produces matching scores for texts and RS images, which can be used to decide the retrieval results directly. The experimental results on two RS datasets indicate our modal is effective for CMRSITR.

Full Text