A Hybrid Attention-Aware Fusion Network (HAFNet) for Building Extraction from High-Resolution Imagery and LiDAR Data

Peng Zhang,Peijun Du,Zhaohui Xue,Erzhu Li,Xin Wang,Cong Lin,Xuyu Bai

doi:10.3390/rs12223764

Abstract

Automated extraction of buildings from earth observation (EO) data has long been a fundamental but challenging research topic. Combining data from different modalities (e.g., high-resolution imagery (HRI) and light detection and ranging (LiDAR) data) has shown great potential in building extraction. Recent studies have examined the role that deep learning (DL) could play in both multimodal data fusion and urban object extraction. However, DL-based multimodal fusion networks may encounter the following limitations: (1) the individual modal and cross-modal features, which we consider both useful and important for final prediction, cannot be sufficiently learned and utilized and (2) the multimodal features are fused by a simple summation or concatenation, which appears ambiguous in selecting cross-modal complementary information. In this paper, we address these two limitations by proposing a hybrid attention-aware fusion network (HAFNet) for building extraction. It consists of RGB-specific, digital surface model (DSM)-specific, and cross-modal streams to sufficiently learn and utilize both individual modal and cross-modal features. Furthermore, an attention-aware multimodal fusion block (Att-MFBlock) was introduced to overcome the fusion problem by adaptively selecting and combining complementary features from each modality. Extensive experiments conducted on two publicly available datasets demonstrated the effectiveness of the proposed HAFNet for building extraction.

Highlights

Accurate building information extracted from earth observation (EO) data is essential for a wide range of urban applications, such as three-dimensional modeling, infrastructure planning, and urban expansion analysis
We investigate whether the proposed Att-MFBlock can be used to handle the cross-modal fusion ambiguity and learn more discriminative and representative cross-modal features, and how it gains an advantage over other fusion methods at the decision stage
The proposed hybrid attention-aware fusion network (HAFNet) was compared with three classical fusion networks in the task of building extraction

Summary

Introduction

Accurate building information extracted from earth observation (EO) data is essential for a wide range of urban applications, such as three-dimensional modeling, infrastructure planning, and urban expansion analysis. HRI provides valuable spectral, geometric, and texture information that are useful to distinguish buildings from non-building objects. Building extraction from HRI is still challenging due to the large intra-class and low inter-class variation of building objects [1], shadow effect, and relief displacement of high buildings [2]. Airborne light detection and ranging (LiDAR) technology provides a promising alternative for extracting buildings. LiDAR measurements are not influenced by shadows, and offer height information of the land surface which can help to separate buildings from other manmade objects (e.g., roads and squares). LiDAR-based building extraction methods are limited due to the lack of texture and boundary information [3]

Methods

Results

Conclusion