The accurate extraction of urban residential space (URS) is of great significance for recognizing the spatial structure of urban function, understanding the complex urban operating system, and scientific allocation and management of urban resources. The traditional URS identification process is generally conducted through statistical analysis or a manual field survey. Currently, there are also superpixel segmentation and wavelet transform (WT) processes to extract urban spatial information, but these methods have shortcomings in extraction efficiency and accuracy. The superpixel wavelet fusion (SWF) method proposed in this paper is a convenient method to extract URS by integrating multi-source data such as Point of Interest (POI) data, Nighttime Light (NTL) data, LandScan (LDS) data, and High-resolution Image (HRI) data. This method fully considers the distribution law of image information in HRI and imparts the spatial information of URS into the WT so as to obtain the recognition results of URS based on multi-source data fusion under the perception of spatial structure. The steps of this study are as follows: Firstly, the SLIC algorithm is used to segment HRI in the Guangdong–Hong Kong–Macao Greater Bay Area (GBA) urban agglomeration. Then, the discrete cosine wavelet transform (DCWT) is applied to POI–NTL, POI–LDS, and POI–NTL–LDS data sets, and the SWF is carried out based on different superpixel scale perspectives. Finally, the OSTU adaptive threshold algorithm is used to extract URS. The results show that the extraction accuracy of the NLT–POI data set is 81.52%, that of the LDS–POI data set is 77.70%, and that of the NLT–LDS–POI data set is 90.40%. The method proposed in this paper not only improves the accuracy of the extraction of URS, but also has good practical value for the optimal layout of residential space and regional planning of urban agglomerations.