Scene classification is a means to interpret high-resolution remote sensing (HRS) imagery, to obtain the high-level semantic information, which can provide a reliable reference for urban planning and monitoring. The traditional scene classification methods based on HRS imagery take uniform grid cells as the scene units, thereby missing the geographic boundaries and leading to the mosaicking effect. Therefore, in this paper, the urban scene is defined as a geographic unit bordered by the road network. The task of urban scene understanding is to recognize the socioeconomic or natural semantics of the urban scene. However, due to the complexity of the urban environment, the effectiveness of the traditional scene classification methods is limited on account of three problems when applied in urban scenes: 1) The lack of socioeconomic information in HRS images, with which it is difficult to discriminate diverse urban scenes with similar exteriors. 2) The large discrepancy in the sizes and shapes of urban land parcels affects the scene feature extraction and representation. 3) Urban scene understanding frameworks that can embed various scene classification models have rarely been studied. In this paper, to solve these problems, a universal urban scene understanding framework based on multi-source geographic data (USUMG) is proposed. In the USUMG framework, road network and water channel data from OpenStreetMap (OSM) are used for generating the urban scene units. For each irregular unit, a scene decomposition method based on a morphological skeleton is employed to represent the urban scene unit by unified processing patches. To integrate the different data sources, the high-level semantic features extracted from the HRS imagery and the socioeconomic features extracted from point of interest (POI) data are fused to determine the urban scene category. Finally, the USUMG framework with various scene classification methods was tested in urban districts of Wuhan and Macao in China to verify the universality and feasibility of the proposed framework. The experimental performances are provided in this paper as a benchmark for urban scene understanding based on multi-source geographic data.