EMTCAL: Efficient Multiscale Transformer and Cross-Level Attention Learning for Remote Sensing Scene Classification

Xu Tang,Xiangrong Zhang,Mingteng Li,Fang Liu,Licheng Jiao,Jingjing Ma

doi:10.1109/tgrs.2022.3194505

Abstract

In recent years, convolutional neural network (CNN)-based methods have been widely used for remote sensing (RS) scene classification tasks and achieved excellent results. However, CNNs are not good at exploring contextual information, which is essential for fully understanding RS scenes. A new model named transformer attracts researchers’ attention to address this problem, which is skilled in mining the latent contextual information in RS scenes. Nevertheless, since the contents of RS scenes are diverse in type and various in scale, the performance of the original transformer in RS scene classification cannot reach what we expect. In addition, due to the specific self-attention mechanism, the time costs of the transformer are high, which hinders its practicability in the RS community. To overcome the above limitations, we propose a new model named efficient multi-scale transformer and cross-level attention learning (EMTCAL) for RS scene classification in this paper. EMTCAL combines the advantages of CNN and transformer to mine information within RS scenes fully. First, it uses a multi-layer feature extraction module (MFEM) to acquire global visual features and multi-level convolutional features from RS scenes. Second, a contextual information extraction module (CIEM) is proposed to capture rich contextual information from multi-level features. In CIEM, taking the characteristics of RS scenes and the computational complexity into account, we propose an efficient multi-scale transformer (EMST). EMST can mine the abundant knowledge with various scales hidden in RS scenes and model their inherent relations at small-time costs. Third, a cross-level attention module (CLAM) is developed to aggregate and explore correlations of multi-level features. Finally, a class score fusion module (CSFM) is designed to integrate the contributions of global and aggregated multi-level features for the discriminative scene representations. Extensive experiments are conducted on three public RS scene data sets. The positive results demonstrate that our EMTCAL can achieve superior classification performance and outperform many state-of-the-art methods.

Full Text