Remote Sensing Scene Classification Based on Convolutional Neural Networks Pre-Trained Using Attention-Guided Sparse Filters

Jingbo Chen,Chengyi Wang,Zhong Ma,Stephen Ackland,Jiansheng Chen,Dongxu He

doi:10.3390/rs10020290

Abstract

Semantic-level land-use scene classification is a challenging problem, in which deep learning methods, e.g., convolutional neural networks (CNNs), have shown remarkable capacity. However, a lack of sufficient labeled images has proved a hindrance to increasing the land-use scene classification accuracy of CNNs. Aiming at this problem, this paper proposes a CNN pre-training method under the guidance of a human visual attention mechanism. Specifically, a computational visual attention model is used to automatically extract salient regions in unlabeled images. Then, sparse filters are adopted to learn features from these salient regions, with the learnt parameters used to initialize the convolutional layers of the CNN. Finally, the CNN is further fine-tuned on labeled images. Experiments are performed on the UCMerced and AID datasets, which show that when combined with a demonstrative CNN, our method can achieve 2.24% higher accuracy than a plain CNN and can obtain an overall accuracy of 92.43% when combined with AlexNet. The results indicate that the proposed method can effectively improve CNN performance using easy-to-access unlabeled images and thus will enhance the performance of land-use scene classification especially when a large-scale labeled dataset is unavailable.

Highlights

IntroductionLand-use scene classification plays an important role in remote-sensing applications, such as land-use mapping
This paper proposed a framework that improves the classification of land-use scenes with unsupervised feature learning on a large-scale unlabeled satellite imagery dataset
To avoid extracting features from such meaningless areas, and to reduce unnecessary computation, our framework incorporates a computational visual attention model inspired by the human visual processing mechanism

Summary

Introduction

Land-use scene classification plays an important role in remote-sensing applications, such as land-use mapping. As the spatial resolution of remote-sensing images becomes finer, land-use scene classification draws unprecedented attention because neither pixel-level nor object-based image analysis (OBIA) could support remote-sensing image understanding on a sematic level. As a core problem in image-related applications, image feature representation exhibits the trend of transference from handcrafted to learning-based methods. Most of the early literature is based on handcrafted features, such as bag-of-visual-words (BoVW) [1,2], part-based models [3], and models fusing global and local descriptors [4,5]. Due to the growing requirements of regional land-use mapping using multi-source and multi-temporal remote-sensing images, handcrafted features are limited in their ability to extract robust and transferable feature representation for image scene classification

Objectives

Methods

Findings

Conclusion