Abstract

Compared with natural scenes, aerial scenes are usually composed of numerous objects densely distributed within the aerial view, and thus, more key local semantic features are needed to describe them. However, when existing CNNs are used for remote sensing image classification, they typically focus on the global semantic features of the image, and especially for deep models, shallow and intermediate features are easily lost. This article proposes a channel–spatial attention mechanism based on a depthwise separable convolution (CSDS) network for aerial scene classification to solve these challenges. First, we construct a depthwise separable convolution (DS-Conv) and pyramid residual connection architecture. DS-Conv extracts features from each channel and merges them, effectively reducing the number of necessary calculations, and the pyramid residual connections connect the features from multiple layers and create associations. Then, the channel–spatial attention algorithm causes the model to obtain more effective features in the channel and spatial domains. Finally, an improved cross-entropy loss function is used to reduce the impact of similar categories on backpropagation. Comparative experiments on three public datasets show that the CSDS network can achieve results comparable to those of other state-of-the-art methods. In addition, visualization of feature extraction results by the Grad-CAM algorithm and ablation experiments for each module reflect the powerful feature learning and representation capabilities of the proposed CSDS network.

Highlights

  • R EMOTE sensing and earth observation, called earth vision, are important branches and applications of computer vision and image understanding [1]–[3]

  • In remote sensing image classification tasks, the UC Merced (UCM) dataset contains 2100 labeled samples, only half of which may be used for training, and is characterized by an uneven sample distribution

  • The overall accuracy (OA), average accuracy (AA), Kappa coefficient (Kappa), F1 score (F1), and confusion matrix (CM) are used in the experiment to describe the performance of the proposed CSDS network

Read more

Summary

INTRODUCTION

R EMOTE sensing and earth observation, called earth vision, are important branches and applications of computer vision and image understanding [1]–[3]. 1) Useless Background Information: The key object of the sample usually determines the label of the remote sensing image. To highlight the key objects and suppress redundant background information, local key features must be extracted to enhance the semantic representation of the aerial image. The main direction angle of the key object in an aerial scene image can change greatly (see Fig. 1 (b)). Due to the large shooting height and angle of aerial scenes, the distribution of key objects is different from the central distribution observed in natural scene images (see Fig. 1). These characteristics increase the difficulty in understanding remote sensing images. Traditional CNNs tend to focus on global semantics, making it difficult to extract the key features of aerial scenes, which may reduce the ability to represent the scene and make it impossible to be accurately classified [11]

Motivation and Objectives
Aerial Scene Classification
Depthwise Separable Convolution
Attention Mechanisms
Feature Extraction Backbone
Dataset Description
Experimental Details
Accuracy Evaluation Indices
Experimental Results
Method
METHODS
CSDS ablation experiment
Findings
Attention Maps on CSDS
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.