Remote Sensing Image Semantic Segmentation Based on Fusion of Transformer and Lightweight Deeplabv3+

Yuanyang Cao,Jian Chen,Gui Hu,Zhentao Xue,Zichao Zhang

doi:10.1007/978-981-19-6613-2_356

Abstract

AbstractAiming at the problems that the classification of remote sensing image datasets is not balanced so that the accuracy of semantic segmentation is not high and the segmentation effect of scarce samples is poor. Transformer encoder with multihead self-attention mechanism is integrated into Deeplabv3+ network, and the attention mechanism is used to enhance the ability to capture global information, so as to improve the precision requirement of scarce categories in remote sensing semantic segmentation task. The algorithm firstly uniformly crops high-resolution remote sensing images into low-resolution image information for batch training and reducing information loss. Secondly, MobileNetV2, a lightweight backbone feature extraction network, is used to replace Xception network of Deeplabv3+. Finally, Transformer encoder based on multihead self-attention mechanism is connected in series with atrous spatial pyramid pooling (ASPP) of Deeplabv3+ encoder region to increase its segmentation precision by enhancing feature learning of scarce category samples in deep feature information. The experimental result shows that the proposed model called TransDeeplabv3+ gets 67.67% mIoU and 81.86% mPA on GID remote sensing dataset. Compared with Deeplabv3+ model, mIoU and mPA are improved by 9.46% and 8.52% respectively. TransDeeplabv3+ can effectively increase the precision because of the ability of increasing attention to the scarce species samples and solve the problem of the decrease of segmentation accuracy due to unbalanced data categories.KeywordsSemantic segmentationDeeplabv3+ TransformerAttention mechanismRemote sensing image

Full Text