Remote sensing (RS) cross-modal text–image retrieval has great application value in many fields such as military and civilian. Existing methods utilize the deep network to project the images and texts into a common space and measure the similarity. However, the majority of those methods only utilize the inter-modality information between different modalities, which ignores the rich semantic information within the specific modality. In addition, due to the complexity of the RS images, there exists a lot of interference relation information within the extracted representation from the original features. In this paper, we propose a jointly guided deep network for fine-grained cross-modal RS text–image retrieval. First, we capture the fine-grained semantic information within the specific modality and then guide the learning of another modality of representation, which can make full use of the intra- and inter-modality information. Second, to filter out the interference information within the representation extracted from the two modalities of data, we propose an interference filtration module based on the gated mechanism. According to our experimental results, significant improvements in terms of retrieval tasks can be achieved compared with state-of-the-art algorithms. The source code is available at https://github.com/CQULab/JGDN .
Read full abstract