Automatic Building Extraction From High-Resolution Aerial Imagery via Fully Convolutional Encoder-Decoder Network With Non-Local Block

Shengsheng Wang,Xiaowei Hou,Xin Zhao

doi:10.1109/access.2020.2964043

Shengsheng Wang, Xiaowei Hou + Show 1 more

Open Access

PDF Available

https://doi.org/10.1109/access.2020.2964043

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Extracting buildings automatically from high-resolution aerial images is a significant and fundamental task for various practical applications, such as land-use statistics and urban planning. Recently, various methods based on deep learning, especially the fully convolution networks, achieve impressive scores in this challenging semantic segmentation task. However, the lack of global contextual information and the careless upsampling method limit the further improvement of the performance for building extraction task. To simultaneously address these problems, we propose a novel network named Efficient Non-local Residual U-shape Network(ENRU-Net), which is composed of a well designed U-shape encoder-decoder structure and an improved non-local block named asymmetric pyramid non-local block (APNB). The encoder-decoder structure is adopted to extract and restore the feature maps carefully, and APNB could capture global contextual information by utilizing self-attention mechanism. We evaluate the proposed ENRU-Net and compare it with other state-of-the-art models on two widely-used public aerial building imagery datasets: the Massachusetts Buildings Dataset and the WHU Aerial Imagery Dataset. The experiments show that the accuracy of ENRU-Net on these datasets has remarkable improvement against previous state-of-the-art semantic segmentation models, including FCN-8s, U-Net, SegNet and Deeplab v3. The subsequent analysis also indicates that our ENRU-Net has advantages in efficiency for building extraction from high-resolution aerial images.

Highlights

Automatic building extraction from high-resolution aerial imagery is a fundamental task for various applications, such as urban planning, economic statistics, disaster monitoring, etc
We evaluate our ENRU-Net on two public aerial imagery building datasets: the Massachusetts Buildings Dataset [24] and the WHU Aerial Imagery dataset [9], and some state-of-the-art models are evaluated as comparisons
It can be observed that all models achieve higher scores on all of the metrics, even the simple fully convolution network (FCN)-8s could reach 98.3%, 85.86%, 92.39% on Overall Accuracy (OA), Intersection over Union (IoU) and F1 respectively, which profits by the lower image complexity, higher labeling precision and spatial resolution

Summary

INTRODUCTION

Automatic building extraction from high-resolution aerial imagery is a fundamental task for various applications, such as urban planning, economic statistics, disaster monitoring, etc. The encoder-decoder networks adopt several cascaded upsampling layers after the feature extractor to expand the feature maps carefully They deliver the shallow feature maps that contain detailed information to the deep layers by skip-connections, which increases the ability to recognize small buildings from high-resolution aerial imagery. To carefully utilize the adequate detailed information in the output feature maps of each encoder in the downsampling path when reconstructing the feature maps, we adopt skip-connections to deliver the shallow feature maps to corresponding decoders in the upsampling path This symmetrical designed structure significantly improves the ability to recognize the boundaries of small buildings when extracting buildings from highresolution aerial images. The subsequent experiments confirm the impressive impact of global contextual information on fully convolution networks for building extraction

EXPERIMENT

Findings

CONCLUSION