Abstract

Deep learning approaches have contributed to the rapid development of Remote Sensing (RS) image interpretation. The most widely used training paradigm is to utilize ImageNet pre-trained models to process RS data for specified tasks. However, there are issues such as domain gap between natural and RS scenes, and the poor generalization capacity of RS models. It makes sense to develop a foundation model with general RS feature representation. Since a large amount of unlabeled data is available, the self-supervised method has more development significance than the fully supervised method in remote sensing. However, most of the current self-supervised methods use contrastive learning, whose performance is sensitive to data augmentation, additional information, and selection of positive and negative pairs. In this paper, we leverage the benefits of generative self-supervised learning for RS images, and propose a Remote sensing foundation Model framework called RingMo, which consists of two parts. First, a large-scale dataset is constructed by collecting two million RS images from satellite and aerial platforms, covering multiple scenes and objects around the world. Second, we propose an RS foundation model training method designed for dense and small objects in complicated RS scenes. We show that the foundation model trained on our dataset with RingMo method achieves state-of-the-art on eight datasets across four downstream tasks, demonstrating the effectiveness of the proposed framework. Through in-depth exploration, we believe it is time for RS researchers to embrace generative self-supervised learning and leverage its general representation capabilities to speed up the development of RS applications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call