Digitalization of large-scale urban scenes (in particular buildings) has been a long-standing open problem, which attributes to the challenges in data acquisition, such as incomplete scene coverage, lack of semantics, low efficiency, and low reliability in path planning. In this paper, we address these challenges in urban building reconstruction from aerial images, and we propose an effective workflow and a few novel algorithms for efficient 3D building instance proxy reconstruction for large urban scenes. Specifically, we propose a novel learning-based approach to instance segmentation of urban buildings from aerial images followed by a voting-based algorithm to fuse the multi-view instance information to a sparse point cloud (reconstructed using a standard Structure from Motion pipeline). Our method enables effective instance segmentation of the building instances from the point cloud. We also introduce a layer-based surface reconstruction method dedicated to the 3D reconstruction of building proxies from extremely sparse point clouds. Extensive experiments on both synthetic and real-world aerial images of large urban scenes have demonstrated the effectiveness of our approach. The generated scene proxy models can already provide a promising 3D surface representation of the buildings in large urban scenes, and when applied to aerial path planning, the instance-enhanced building proxy models can significantly improve data completeness and accuracy, yielding highly detailed 3D building models.