Abstract

The applied research in remote sensing images has been pushed by convolutional neural network (CNN). Because of the fixed size of the perceptual field, CNN is unable to model global semantic relevance. Modeling global semantic information is possible with the self-attentive Transformer-based model. However, the method of patch computation used by Transformer for self-attentive computation ignores the spatial information inside each patch. To address these issues, we offer the STransFuse model as a new semantic segmentation method for remote sensing images. It is a model that combines the benefits of Transformer with CNN to improve the segmentation quality of various remote sensing images. We employ a staged model to extract coarse-grained and fine-grained feature representations at various semantic scales, unlike earlier techniques based on Transformer model fusion. In order to take full advantage of the features acquired at different stages, we designed an adaptive fusion module. This module adaptively fuses the semantic information between features at different scales employing a self-attentive mechanism. The overall accuracy (OA) of our proposed model on the Vaihingen dataset is 1.36% higher than the baseline, and 1.27% improvement in OA over baseline on the Potsdam dataset. When compared to other advanced models, the STransFuse model performs admirably.

Highlights

  • A PIXEL-LEVEL classification challenge, semantic segmentation of remote sensing images, is an essential problem for remote sensing research

  • Because many indicators are based on confusion matrix for calculation, before introducing the specific formula of each indicator, the meaning of some symbols of the confusion matrix is defined as follows: True positive (TP), true negative (TN), false positive (FP), and false negative (FN)

  • Because Transformer is based on self-attention for semantic computation, the number of model parameters improved based on Transformer is large, and our designed STransFuse model can balance the number of model parameters and experimental performance

Read more

Summary

INTRODUCTION

A PIXEL-LEVEL classification challenge, semantic segmentation of remote sensing images, is an essential problem for remote sensing research. GAO et al.: STRANSFUSE: FUSING SWIN TRANSFORMER AND CNN FOR REMOTE SENSING IMAGE SEMANTIC SEGMENTATION. Inspired by the Unet network [9], we fused the feature maps of different stages, which were used to obtain semantic contextual information and spatial contextual information of the images. To this end, we propose a model for semantic segmentation of remote sensing images, STransFuse. Inspired by the paper [14], we used Resnet with pretrained weights as the network backbone of the CNN branch, combined with Swin Transformer, to obtain the rich feature information of remote sensing images.

Semantic Segmentation of Remote Sensing Images
Contextual Information
Transformer
Overview
STransFuse Overall Architecture
Swin Transformer Block
AFM Block
Dataset
Evaluation Metric
Training Configuration
Ablation Studies
Visualization Analysis
Window Size Impact Analysis
Confusion Matrix
Evaluation and Comparisons on the Vaihingen Dataset
Evaluation and Comparisons on the Potsdam Dataset
Comparison of the Efficiency of State-of-the-Art Models in Different Datasets
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call