Abstract

The semantic segmentation of remote sensing images requires distinguishing local regions of different classes and exploiting a uniform global representation of the same-class instances. Such requirements make it necessary for the segmentation methods to extract discriminative local features between different classes and to explore representative features for all instances of a given class. While common deep convolutional neural networks (DCNNs) can effectively focus on local features, they are limited by their receptive field to obtain consistent global information. In this paper, we propose a memory-augmented transformer (MAT) to effectively model both the local and global information. The feature extraction pipeline of the MAT is split into a memory-based global relationship guidance module and a local feature extraction module. The local feature extraction module mainly consists of a transformer, which is used to extract features from the input images. The global relationship guidance module maintains a memory bank for the consistent encoding of the global information. Global guidance is performed by memory interaction. Bidirectional information flow between the global and local branches is conducted by a memory-query module, as well as a memory-update module, respectively. Experiment results on the ISPRS Potsdam and ISPRS Vaihingen datasets demonstrated that our method can perform competitively with state-of-the-art methods.

Highlights

  • Semantic segmentation of high-resolution remote sensing images [1,2,3,4] is an important application scenario in remote sensing image interpretation, which is widely used in land mapping, environmental monitoring, urban construction, etc

  • The experiment results demonstrated that the memory-augmented transformer (MAT), which facilitates both the prior information and the input-based representation, performs well in high-resolution remote sensing image semantic segmentation tasks

  • Prior information was added to the network via learnable memory tokens

Read more

Summary

Introduction

Semantic segmentation of high-resolution remote sensing images [1,2,3,4] is an important application scenario in remote sensing image interpretation, which is widely used in land mapping, environmental monitoring, urban construction, etc. Mainly depend on low-level features such as color, edge, shape, and spatial locations and use heuristic methods such as clustering or thresholding to translate the features into the final segmentation masks. Due to the limited representation power of low-level features and the overtuned parameters of the clustering methods, the performance of these methods is far from satisfactory. The emergence of deep convolutional neural networks (DCNNs) has equipped us with more powerful representation abilities and has boosted the performance of remote sensing image recognition. DCNNs [7,8] take the remote sensing image as the input and directly map the input image into the desired output (class, object boxes, and masks). In the remote sensing image semantic segmentation field, many works [9,10] using convolutional neural networks have been proposed to tackle the problem. The segmentation results are better than traditional methods thanks to the deep layers and the end-to-end training paradigm

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call