Recent Transformer-based works can generate high-quality captions for remote sensing images (RSIs). However, these methods generally feed global or grid visual features to a Transformer-based captioning model for associating cross-modal information, which limits performance. In this work, we investigate unexplored ideas for a remote sensing image captioning task, using a novel patch-level region-aware module with a multi-label framework. Due to an overhead perspective and a significantly larger scale in RSIs, a patch-level region-aware module is designed to filter the redundant information in the RSI scene, which benefits the Transformer-based decoder by attaining improved image perception. Technically, the trainable multi-label classifier capitalizes on semantic features as supplementary to the region-aware features. Moreover, modeling the inner relations of inputs is essential for understanding the RSI. Thus, we introduce region-oriented attention, which associates region features and semantic labels, omits the irrelevant regions to highlight relevant regions, and learns related semantic information. Extensive qualitative and quantitative experimental results show the superiority of our approach on the RSICD, UCM-Captions, and Sydney-Captions. The code for our method will be publicly available.
Read full abstract