Abstract
Semantic segmentation plays an indispensable role in automatic analysis of remote sensing image data. However, the abundant semantic information and irregular shape patterns in remote sensing images are difficult to utilize, making it hard to segment remote sensing images only using convolution and single-scale feature maps. To achieve better segmentation performance, a multiscale feature pyramid decoder (MFPD) is proposed to fuse image features extracted by vision transformer (ViT). The decoder employs a novel 2-D-to-3-D transform method to obtain multiscale feature maps that contain rich context information and fuses the multiscale feature maps by channel concatenation. Furthermore, a dimension attention module (DAM) is designed to further aggregate the context information of the extracted remote sensing image features. This approach yields superior mean intersection over union (mIoU) on the Gaofen2-CZ dataset (60.42%) and GID-5 dataset (68.21%). Experimental results indicate that the comprehensive performance of our approach exceeds the compared segmentation methods based on convolutional neural network (CNN) and ViT.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have