Abstract

Due to the variation in spatial resolution and the diversity of object scales, the interpretation of optical remote sensing images is extremely challenging. Deep learning has become the mainstream solution to interpret such complex scenes. However, the explosion of deep learning model architectures has resulted in the need for hundreds of millions of remote sensing images for which labels are very costly or often unavailable publicly. This paper provides an in-depth analysis of the main reasons for this data thirst, i.e., (i) limited representational power for model learning, and (ii) underutilization of unlabeled remote sensing data. To overcome the above difficulties, we present a scalable and adaptive self-supervised Transformer (AST) for optical remote sensing image interpretation. By performing masked image modeling in pre-training, the proposed AST releases the rich supervision signals in massive unlabeled remote sensing data and learns useful multi-scale semantics. Specifically, a cross-scale Transformer architecture is designed to collaboratively learn global dependencies and local details by introducing a pyramid structure, to facilitate multi-granular feature interactions and generate scale-invariant representations. Furthermore, a masking token strategy relying on correlation mapping is proposed to achieve adaptive masking of partial patches without affecting key structures, which enhances the understanding of visually important regions. Extensive experiments on various optical remote sensing interpretation tasks show that AST has good generalization capability and competitiveness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call