A Novel Encoder-Decoder Structure-based Transformer for Fine-Resolution Remote Sensing Images

Guixian Wang,Zhenye Geng,Jin Duan,Zhi Liu,Dandan Huang

doi:10.1088/1742-6596/2517/1/012017

Abstract

Full convolution neural network (FCN) based on an encoder-decoder structure has become a standard network in the semantic segmentation domain. Encoder-decoder architecture is an effective means to get finer-grained performance. Encoders constantly extract multilevel features, and then use decoders to gradually introduce low-level features into high-level features. Context information is critical for accurate segmentation, which is the main direction of semantic segmentation at present. So many efforts have been made to make better use of this kind of information, including codec structure, void convolution (expanded convolution), and attention mechanism. However, most of these schemes are based on Resnet or other variants of convolution network FCN, which makes it unable to get rid of the defective local receptive field of convolution itself. In this work, we introduce the pyramid visual converter (PVT) to replace the traditional full convolution network architecture, and design a novel encoder-decoder architecture to more effectively utilize the context information.

Full Text