Abstract

In modern remote sensing image change detection (CD), convolution Neural Network (CNN), especially U-shaped structure (UNet), has achieved great success due to powerful discriminative ability. However, UNet-based CNN networks usually have limitations in modeling global dependencies due to intrinsic locality of convolution operations. Transformer has recently emerged as an alternative architecture for dense prediction tasks due to global self-attention mechanism. However, due to limitation of hardware resources, pure Transformer methods generally lack the ability to capture global information at a low level. Based on these existing problems, we propose STransUNet, which combines Transformer and UNet architecture. STransUNet can not only capture shallow detail features at an early stage, but also model global context in high-level feature. In addition, we design an efficient feature fusion module named Cross-Enhanced Adaptive Fusion(CEAF). Our model mainly consists of three parts: encoder, fusion module and decoder. The decoder is a CNN-Transformer hybrid structure. CNN extracts multi-level feature information. Transformer encodes tokenized sequence to capture global context. CEAF module cross-enhances and adaptively fuses bi-temporal features to enhance feature representation. In decoding stage, we introduce a Cascaded Upsampling decoder(CUP). CUP progressively aggregates low-level CNN features and high-level Transformer features to full resolution. On four public CD datasets, our STransUNet achieves better CD results than six state-of-the-art algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call