Abstract
The applied research in remote sensing images has been pushed by convolutional neural network (CNN). Because of the fixed size of the perceptual field, CNN is unable to model global semantic relevance. Modeling global semantic information is possible with the self-attentive Transformer-based model. However, the method of patch computation used by Transformer for self-attentive computation ignores the spatial information inside each patch. To address these issues, we offer the STransFuse model as a new semantic segmentation method for remote sensing images. It is a model that combines the benefits of Transformer with CNN to improve the segmentation quality of various remote sensing images. We employ a staged model to extract coarse-grained and fine-grained feature representations at various semantic scales, unlike earlier techniques based on Transformer model fusion. In order to take full advantage of the features acquired at different stages, we designed an adaptive fusion module. This module adaptively fuses the semantic information between features at different scales employing a self-attentive mechanism. The overall accuracy (OA) of our proposed model on the Vaihingen dataset is 1.36% higher than the baseline, and 1.27% improvement in OA over baseline on the Potsdam dataset. When compared to other advanced models, the STransFuse model performs admirably.
Highlights
A PIXEL-LEVEL classification challenge, semantic segmentation of remote sensing images, is an essential problem for remote sensing research
Because many indicators are based on confusion matrix for calculation, before introducing the specific formula of each indicator, the meaning of some symbols of the confusion matrix is defined as follows: True positive (TP), true negative (TN), false positive (FP), and false negative (FN)
Because Transformer is based on self-attention for semantic computation, the number of model parameters improved based on Transformer is large, and our designed STransFuse model can balance the number of model parameters and experimental performance
Summary
A PIXEL-LEVEL classification challenge, semantic segmentation of remote sensing images, is an essential problem for remote sensing research. GAO et al.: STRANSFUSE: FUSING SWIN TRANSFORMER AND CNN FOR REMOTE SENSING IMAGE SEMANTIC SEGMENTATION. Inspired by the Unet network [9], we fused the feature maps of different stages, which were used to obtain semantic contextual information and spatial contextual information of the images. To this end, we propose a model for semantic segmentation of remote sensing images, STransFuse. Inspired by the paper [14], we used Resnet with pretrained weights as the network backbone of the CNN branch, combined with Swin Transformer, to obtain the rich feature information of remote sensing images.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.