Abstract

Remote sensing scene classification remains challenging due to the complexity and variety of scenes. With the development of attention-based methods, Convolutional Neural Networks (CNNs) have achieved competitive performance in remote sensing scene classification tasks. As an important method of the attention-based model, the Transformer has achieved great success in the field of natural language processing. Recently, the Transformer has been used for computer vision tasks. However, most existing methods divide the original image into multiple patches and encode the patches as the input of the Transformer, which limits the model’s ability to learn the overall features of the image. In this paper, we propose a new remote sensing scene classification method, Remote Sensing Transformer (TRS), a powerful “pure CNNs → Convolution + Transformer → pure Transformers” structure. First, we integrate self-attention into ResNet in a novel way, using our proposed Multi-Head Self-Attention layer instead of 3 × 3 spatial revolutions in the bottleneck. Then we connect multiple pure Transformer encoders to further improve the representation learning performance completely depending on attention. Finally, we use a linear classifier for classification. We train our model on four public remote sensing scene datasets: UC-Merced, AID, NWPU-RESISC45, and OPTIMAL-31. The experimental results show that TRS exceeds the state-of-the-art methods and achieves higher accuracy.

Highlights

  • Accepted: 13 October 2021With the rapid development of remote sensing technology and the emergence of more sophisticated remote sensing sensors, remote sensing technologies have been widely used in various fields [1,2,3,4]

  • We demonstrate the connection between the MHSA-Bottleneck and Transformer, and regard MHSA- Bottleneck as a 3D Transformer

  • The main purpose of this paper is to demonstrate that optimizing Convolutional Neural Networks (CNNs) with Transformers can improve the performance of the network

Read more

Summary

Introduction

Accepted: 13 October 2021With the rapid development of remote sensing technology and the emergence of more sophisticated remote sensing sensors, remote sensing technologies have been widely used in various fields [1,2,3,4]. As one of the core tasks of remote sensing, remote sensing scene classification is often used as a benchmark to measure the understanding of remote sensing scene images. The traditional remote sensing scene classification method mainly relies on the spatial features of images [5,6]. Many deep convolutional neural network models have made significant progress in remote sensing scene classification with the development of deep learning. Neural networks based on convolution operations need to stack multiple layers [9]. He et al [10] proposed ResNet to make Convolutional

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.