Abstract

Automated methods to extract buildings from very high resolution (VHR) remote sensing data have many applications in a wide range of fields. Many convolutional neural network (CNN) based methods have been proposed and have achieved significant advances in the building extraction task. In order to refine predictions, a lot of recent approaches fuse features from earlier layers of CNNs to introduce abundant spatial information, which is known as skip connection. However, this strategy of reusing earlier features directly without processing could reduce the performance of the network. To address this problem, we propose a novel fully convolutional network (FCN) that adopts attention based re-weighting to extract buildings from aerial imagery. Specifically, we consider the semantic gap between features from different stages and leverage the attention mechanism to bridge the gap prior to the fusion of features. The inferred attention weights along spatial and channel-wise dimensions make the low level feature maps adaptive to high level feature maps in a target-oriented manner. Experimental results on three publicly available aerial imagery datasets show that the proposed model (RFA-UNet) achieves comparable and improved performance compared to other state-of-the-art models for building extraction.

Highlights

  • Automatic extraction of buildings from remote sensing imagery is of paramount importance in many application areas such as urban planning, population estimation, and disaster response [1]

  • We evaluated the effect of the proposed joint attention module in UNet for building extraction in the very high resolution (VHR) images

  • Applying the attention mechanism to the segmentation model UNet, we observe that our joint attention module improves the performance of existing architecture for the task of building extraction in VHR images

Read more

Summary

Introduction

Automatic extraction of buildings from remote sensing imagery is of paramount importance in many application areas such as urban planning, population estimation, and disaster response [1]. Assigning a semantic building class label to each pixel in very high resolution (VHR) imagery of urban areas is a challenging task because of high intra-class and low inter-class variabilities [2,3]. This is because in high resolution images, the building category contains many different sized manmade-objects in urban areas, where the amount of clutters is increasing—e.g., the shadow of tall buildings—the similarity of rooftops to some roads. The patch-based CNNs methods [9,10,11,12,13] were initially adopted for prediction in dense urban areas These patched-CNNs label the center pixel by processing an image patch through a neural network. Though FCN-based methods can produce dense pixel-wise output directly, the pixel-wise classification derived from the final score map is quite coarse because of the sequential sub-sampling operations in the FCN

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call