Abstract

Semantic segmentation of remotely sensed urban scene images is widely demanded in areas such as land cover mapping, urban change detection, and environmental protection. With the development of deep learning, methods based on convolutional neural networks (CNNs) have been dominant due to their powerful ability to represent hierarchical feature information. However, the limitations of the convolution operation itself limit the network’s ability to extract global contextual information. With the successful use of transformer in computer vision in recent years, transformer has shown great potential for modeling global contextual information. However, transformer is not sufficiently capable of capturing local detailed information. In this article, to explore the potential of the joint CNN and transformer mechanism for semantic segmentation of remotely sensed urban scenes, we propose a CNN and transformer multiscale fusion network (CTMFNet) based on encoding–decoding for urban scene understanding. To couple local–global context information more efficiently, we designed a dual backbone attention fusion module (DAFM) to couple the local and global context information of the dual-branch encoder. In addition, to bridge the semantic gap between scales, we built a multi-layer dense connectivity network (MDCN) as our decoder. The MDCN enables the full flow of semantic information between multiple scales to be fused with each other through upsampling and residual connectivity. We conducted extensive subjective and objective comparison experiments and ablation experiments on both the International Society of Photogrammetry and Remote Sensing (ISPRS) Vaihingen and ISPRS Potsdam datasets. Numerous experimental results have proven the superiority of our method compared to currently popular methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call