Deep learning has revolutionized the remote sensing image processing techniques over the past few years. Nevertheless, annotating high-quality samples is difficult and time-consuming, which limits the performance of deep neural networks because of insufficient supervision information. Aiming to solve this contradiction, we investigate the multimodal self-supervised learning (MultiSSL) paradigm for pre-training and classification of remote sensing image. Specifically, the proposed self-supervised feature learning model consists of asymmetric encoder–decoder structure, in which deep unified encoder learns high-level key information characterizing multimodal remote sensing data and task-specific lightweight decoders are developed to reconstruct original data. To further enhance feature extraction capability, the cross-attention layers are utilized to exchange information contained in heterogeneous characteristics, thus learning more complementary information from multimodal remote sensing data. In fine-tuning stage, the pre-trained encoder and cross-attention layer serve as feature extractor, and leaned characteristics are combined with corresponding spectral information for land cover classification through a lightweight classifier. The self-supervised pre-training model can learn high-level key features from unlabeled samples, thereby utilizing the feature extraction capability of deep neural networks while reducing their dependence on annotated samples. Compared with existing classification paradigms, the proposed multimodal self-supervised pre-training and fine-tuning scheme achieves superior performance for remote sensing image land cover classification.
Read full abstract