ABSTRACT Remote sensing image acquisition is an essential way to obtain information. However, research on remote sensing images mainly focuses on object detection or image classification. The emergence of remote sensing image captioning (RSIC) has enabled understanding and inference of remote sensing images, thus attracting considerable attention. There are still challenges in RSIC: the features used in RSIC are mostly based on grid features, and this form of features makes it difficult for the model to determine the main description targets. Hence, a more effective cross-modal matching method is needed for better text generation. Thus, we propose a region-guided transformer in response to the aforementioned issues. We extracted region features to enhance the ability of the model to focus on the main targets. To address the issue of information loss caused by region feature extraction, we proposed environment features to supplement background information. To improve the matching between text and image features, we propose a region-guided decoder that enhances the model's perception of different features through a weighted cross-attention mechanism. Meanwhile, we introduce region-guided information to guide the text-generation process. The effectiveness and superiority of our model have been demonstrated through extensive experiments.
Read full abstract