Abstract

Remote sensing image scene classification is challenging due to the complicated spatial arrangement and varied object sizes inside a large-scale aerial image. Among the bottlenecks for current deep learning methods to depict and discriminate the complexity of remote sensing scenes, strengthening the local semantic representation and multi-scale feature representation is necessary. In this paper, we propose a multi-scale staking attention pooling (MS2AP) to tackle these challenges, which has three main contributions. Firstly, it can be conveniently embedded into current CNN models in an end-to-end manner to enhance the feature representation capability for remote sensing scenes. Secondly, we propose a novel residual channel-spatial attention module to mine the key local semantics in the feature maps. Compared with current attention modules, it can fuse top-down discriminative features and bottom-up convolution features from both the channel and spatial domain. Thirdly, we propose a multi-scale dilated convolutional operator which can extract multi-scale feature maps and keep their sizes the same. In our MS2AP, these multi-scale feature maps are firstly staked and then down-sampled by a weighted pooling whose weight matrix comes from our attention module. Extensive experiments demonstrate that our MS2AP outperforms the baseline by 4.24% on UCM, 7.22% on AID and 14.12% on NWPU benchmark respectively, and substantially outperforms current state-of-the-art methods by a large margin.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call