Abstract

Analyzing land cover changes with multi-temporal remote sensing (RS) images is crucial for environmental protection and land planning. In this paper, we explore Remote Sensing Image Change Captioning (RSICC), a new task aiming at generating human-like language descriptions for the land cover changes in multi-temporal RS images. We propose a novel Transformer-based RSICC model (RSICCformer). It consists of three main components: 1) a CNN-based feature extractor to generate high-level features of RS image pairs, 2) a dual-branch Transformer encoder to improve the feature discrimination capacity for the changes, and 3) a caption decoder to generate sentences describing the differences. The dual-branch Transformer encoder consists of a hierarchy of processing stages to capture and recognize multiple changes of interest. Concretely, we use the bi-temporal feature differences as keys to enhance image features (queries) from each temporal image in the dual-branch Transformer encoder. To explore the RSICC task, we build a large-scale dataset named LEVIR-CC, which contains 10077 pairs of bi-temporal RS images and 50385 sentences describing the differences between images. We benchmark existing state-of-the-art synthetic image change captioning methods on the LEVIR-CC dataset, and our RSICCformer outperforms previous methods with a significant margin (+4.98% on BLEU-4 and +9.86% on CIDEr-D). The attention visualization results also suggest that our model can focus on changes of interest and ignore irrelevant changes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call