Abstract
Extracting buildings automatically from high-resolution aerial images is a significant and fundamental task for various practical applications, such as land-use statistics and urban planning. Recently, various methods based on deep learning, especially the fully convolution networks, achieve impressive scores in this challenging semantic segmentation task. However, the lack of global contextual information and the careless upsampling method limit the further improvement of the performance for building extraction task. To simultaneously address these problems, we propose a novel network named Efficient Non-local Residual U-shape Network(ENRU-Net), which is composed of a well designed U-shape encoder-decoder structure and an improved non-local block named asymmetric pyramid non-local block (APNB). The encoder-decoder structure is adopted to extract and restore the feature maps carefully, and APNB could capture global contextual information by utilizing self-attention mechanism. We evaluate the proposed ENRU-Net and compare it with other state-of-the-art models on two widely-used public aerial building imagery datasets: the Massachusetts Buildings Dataset and the WHU Aerial Imagery Dataset. The experiments show that the accuracy of ENRU-Net on these datasets has remarkable improvement against previous state-of-the-art semantic segmentation models, including FCN-8s, U-Net, SegNet and Deeplab v3. The subsequent analysis also indicates that our ENRU-Net has advantages in efficiency for building extraction from high-resolution aerial images.
Highlights
Automatic building extraction from high-resolution aerial imagery is a fundamental task for various applications, such as urban planning, economic statistics, disaster monitoring, etc
We evaluate our ENRU-Net on two public aerial imagery building datasets: the Massachusetts Buildings Dataset [24] and the WHU Aerial Imagery dataset [9], and some state-of-the-art models are evaluated as comparisons
It can be observed that all models achieve higher scores on all of the metrics, even the simple fully convolution network (FCN)-8s could reach 98.3%, 85.86%, 92.39% on Overall Accuracy (OA), Intersection over Union (IoU) and F1 respectively, which profits by the lower image complexity, higher labeling precision and spatial resolution
Summary
Automatic building extraction from high-resolution aerial imagery is a fundamental task for various applications, such as urban planning, economic statistics, disaster monitoring, etc. The encoder-decoder networks adopt several cascaded upsampling layers after the feature extractor to expand the feature maps carefully They deliver the shallow feature maps that contain detailed information to the deep layers by skip-connections, which increases the ability to recognize small buildings from high-resolution aerial imagery. To carefully utilize the adequate detailed information in the output feature maps of each encoder in the downsampling path when reconstructing the feature maps, we adopt skip-connections to deliver the shallow feature maps to corresponding decoders in the upsampling path This symmetrical designed structure significantly improves the ability to recognize the boundaries of small buildings when extracting buildings from highresolution aerial images. The subsequent experiments confirm the impressive impact of global contextual information on fully convolution networks for building extraction
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.