Abstract

2D position information of input tokens is essential for transformer-based semantic segmentation models, especially on high-resolution aerial images. However, recent transformer-based segmentation methods use the position encoding to record position information and most position encoding methods encode the 1D positions of tokens. Therefore, we propose a two-dimensional semantic transformer model (2DSegFormer) for semantic segmentation on aerial images. In 2DSegFormer, we design a novel 2D positional attention to accurately record the 2D position information required by the transformer. Furthermore, we design the dilated residual connection and use it instead of skip connection in the deep stages to get a larger receptive field. skip connections are used in the shallow stages of 2DSegFormer to pass the details to the corresponding stages in the decoder. Experimental results on UAVid, Vaihingen and AeroScapes data sets demonstrate the effectiveness of 2DSegFormer. Compared with the state-of-the-art methods, 2DSegFormer shows better performance and great robustness on three different data sets. In particular, 2DSegFormer-B2 achieves first place in the public ranking on the UAVid test set.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call