Abstract

Deep convolutional neural networks (CNNs) are significant in remote sensing. Due to the strong local representation learning ability, CNNs have excellent performance in remote sensing scene classification. However, CNNs focus on location-sensitive representations in the spatial domain and lack contextual information mining capabilities. Meanwhile, remote sensing scene classification still faces challenges, such as complex scenes and significant differences in target sizes. To address the problems and challenges above, more robust feature representation learning networks are necessary. In this paper, a novel and explainable spatial-frequency multi-scale Transformer framework, SF-MSFormer, is proposed for remote sensing scene classification. It mainly comprises spatial-domain and frequency-domain multi-scale Transformer branches, which consider the spatial-frequency global multi-scale representation features. Besides, the texture-enhanced encoder is designed in the frequency-domain multi-scale Transformer branch, which is adaptive to capture the global texture features. In addition, an adaptive feature aggregation module is designed to integrate the spatial-frequency multi-scale feature for final recognition. The experimental results verify the effectiveness of SF-MSFormer and show better convergence. It achieves state-of-the-art results (98.72%, 98.6%, 99.72%, and 94.83% overall accuracies, respectively) on the AID, UCM, WHU-RS19, and NWPU-RESISC45 datasets. Besides, the feature visualizations evaluate the explainability of the texture-enhanced encoder. The code implementation of this article will be available at https://github.com/yutinyang/SF-MSFormer.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call