Abstract
Learning powerful discriminative features is the key for remote sensing scene classification. Most existing approaches based on convolutional neural network (CNN) have achieved great results. However, they mainly focus on global-based visual features while ignoring object-based location features, which is important for large-scale scene classification. There are a large number of scene-related ground objects in remote sensing images, as well as Graph convolutional network (GCN) has the potential to capture the dependencies among objects. This article introduces a novel two-stream architecture that combines global-based visual features and object-based location features, so as to improve the feature representation capability. First, we extract appearance visual features from whole scene image based on CNN. Second, we detect ground objects and construct a graph to learn the spatial location features based on GCN. As a result, the network can jointly capture appearance visual information and spatial location information. To the best of authors' knowledge, we are the first to investigate the dependencies among objects in remote sensing scene classification task. Extensive experiments on two datasets show that our framework improves the discriminative ability of features and achieves competitive accuracy against other state-of-the-art approaches.
Highlights
R EMOTE sensing scene classification aims to automatically classify remote sensing images into specific categories based on semantic content, which has attracted great attention in recent years due to the wide range of applications, such as natural disaster monitoring, land cover analysis, and urban planning [1]–[4]
1) To further improve classification accuracy, we propose a novel two-stream architecture, which combined convolutional neural network (CNN) and graph convolutional network (GCN) to learn both the global-based visual features and the object-based location features
3) By combining the CNN and GCN, our proposed method improves the discriminative ability of features and achieves competitive accuracy against other state-of-theart approaches
Summary
R EMOTE sensing scene classification aims to automatically classify remote sensing images into specific categories based on semantic content, which has attracted great attention in recent years due to the wide range of applications, such as natural disaster monitoring, land cover analysis, and urban planning [1]–[4]. The most commonly used approach is BoVW, which encodes the local features of image patches into visual dictionaries by k-means clustering [12], [13] These mid-level features are highly efficient, they ignore the spatial distribution information of remote sensing scene image leading to poor representation capability. Our proposed method takes advantage of appearance visual information as well as spatial location and distribution cues to make better predictions for remote sensing scene classification. To further improve the performance of the remote sensing scene classification, we propose a novel two-stream architecture, which can jointly learn both the global-based visual features and the object-based location features. 1) To further improve classification accuracy, we propose a novel two-stream architecture, which combined CNN and GCN to learn both the global-based visual features and the object-based location features.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.