Urban scenes refer to city blocks which are basic units of megacities, they play an important role in citizens’ welfare and city management. Remote sensing imagery with largescale coverage and accurate target descriptions, has been regarded as an ideal solution for monitoring the urban environment. However, due to the heterogeneity of remote sensing images, it is difficult to access their geographical content at the object level, let alone understanding urban scenes at the block level. Recently, deep learning-based strategies have been applied to interpret urban scenes with remarkable accuracies. However, the deep neural networks require a substantial number of training samples which are hard to satisfy, especially for high-resolution images. Meanwhile, the crowed-sourced Open Street Map (OSM) data provides rich annotation information about the urban targets but may encounter the problem of insufficient sampling (limited by the places where people can go). As a result, the combination of OSM and remote sensing images for efficient urban scene recognition is urgently needed. In this paper, we present a novel strategy to transfer existing OSM data to high-resolution images for semantic element determination and urban scene understanding. To be specific, the object-based convolutional neural network (OCNN) can be utilized for geographical object detection by feeding it rich semantic elements derived from OSM data. Then, geographical objects are further delineated into their functional labels by integrating points of interest (POIs), which contain rich semantic terms, such as commercial or educational labels. Lastly, the categories of urban scenes are easily acquired from the semantic objects inside. Experimental results indicate that the proposed method has an ability to classify complex urban scenes. The classification accuracies of the Beijing dataset are as high as 91% at the object-level and 88% at the scene level. Additionally, we are probably the first to investigate the object level semantic mapping by incorporating high-resolution images and OSM data of urban areas. Consequently, the method presented is effective in delineating urban scenes that could further boost urban environment monitoring and planning with high-resolution images.