Local and Long-Range Collaborative Learning for Remote Sensing Scene Classification

Maofan Zhao,Xinli Hu,Qingyan Meng,Lorenzo Bruzzone,Linlin Zhang

doi:10.1109/tgrs.2023.3265346

Abstract

With the development of high-resolution satellites, more and more attention has been paid to remote sensing (RS) scene classification. Convolutional neural networks (CNNs), which replace the traditional handcrafted features with a learning-based feature extraction mechanism, are widely used in scene classification. But CNNs are less effective in deriving long-range contextual relations, which limits the further improvement. Visual transformer (VT), an emerging image processing method, provides a new perspective for RS scene classification by directly acquiring long-range features. Although there have been limited works combining CNN and VT through simple concatenation, the collaborations between them are insufficient. To address these issues, we propose a local and long-range collaborative framework (L2RCF). First, we design a dual-stream structure to extract the local and long-range features. Second, a cross-feature calibration (CFC) module is designed for them to improve representation of the fusion features. Then, combining deep supervision (DS) and deep mutual learning (DML), a novel joint loss is proposed to enhance the dual-stream feature extractor and further improve the fused features. Finally, a two-stage semi-supervised training strategy is designed to improve performance with unlabeled samples. To demonstrate the effectiveness of L2RCF, we conducted experiments on three widely used RS scene classification data sets: RSSCN7, AID, and NWPU. The results show that L2RCF performs significantly better compared with some state-of-the-art scene classification methods.

Full Text