Urbanization has been a driving force for economic growth, yet it has also caused the emergence of informal urban settlements such as urban villages (UVs), which are characterized by issues such as arbitrary land use, high-density construction, and insufficient infrastructure. In previous studies on UV detection, the semantic imbalance and feature interaction among cross-modal data have not been comprehensively considered, impacting the accuracy and interpretability of the results. In this work, a cross-modal fusion framework is proposed that integrates high-resolution remote sensing and street view images for UV detection. First, convolutional neural networks (ResNet-50) are used for feature extraction from both remote sensing and street view images. Then, an inner product channel attention module is used to dynamically adjust weights while considering multiangle views of street view images. A cross-modal feature fusion module that incorporates dilation convolution and a global-based feature fusion block is used to enhance feature interaction and fusion. The method has an overall accuracy (OA) of 0.975 for UV classification in a case study of the Guangzhou–Foshan metropolitan area in China, outperforming a set of baseline methods. The integration of remote sensing and street view images improves the OA value by approximately 2%. This work enhances the understanding of the distribution of UVs via both top-down and ground-level view data in an automatic and efficient way, providing urban planners with valuable insights to accurately identify UVs and support targeted, sustainable urban renewal aligned with the SDGs for inclusive, resilient cities.
Read full abstract