Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images

Xizi Yu,Shuang Li,Yu Zhang

doi:10.1080/22797254.2024.2361768

Abstract

ABSTRACT Though convolutional neural networks (CNN) exhibit promise in image semantic segmentation, they have limitations in capturing global context information, resulting in inaccuracies in segmenting small object features and object boundaries. This study introduces a hybrid network, ICTANet, which incorporate convolutional and Transformer architectures to improve the segmentation performance of fine-resolution remote sensing urban imagery. The ICTANet model is essentially a Transformer-based encoder-decoder structure. The dual-encoder architecture, which combines CNN and Swin Transformer modules, is designed to extract both global and local detail information. The feature information at various stages is collected by the Feature Extraction and Fusion modules (FEF), enabling multi-scale contextual information fusion. In addition, an Auxiliary Boundary Detection (ABD) module is introduced at the end of the decoder to enhance the model’s ability to capture object boundary information. Numerous ablation experiments have been conducted to demonstrate the efficacy of various components within the network. The testing results have proven that the proposed model can achieve satisfactory performance on the ISPRS Vaihingen and Potsdam dataset, with overall accuracies of 91.9% and 92.0%, respectively. Simultaneously, the proposed model is also compared to the current state-of-the-art methods, exhibiting competitive performance, particularly in the segmentation of diminutive objects like cars and trees.

Full Text