Calibration of an urban-scale building energy model is crucial for advancing building energy efficiency codes and policies to reduce carbon dioxide emissions whereas the accuracy of the model is challenged by performance gaps resulting from sparse data and uncertainties in modeling. An empirical Bayesian (EB) approach, incorporated a probability-based mixed-effects energy model and Monte Carlo Markov chain simulation, was developed to predict and benchmark building energy use intensity (EUI) for space heating, using the metering data of 2020 stochastically sampled district heating substations from a municipal utility database. The limitations, implementations and performances of different modeling strategy, including EB as well as Maximum Likelihood Estimation (MLE), Maximum A Posteriori (MAP) and Fully Bayes (FB) were compared. The results show that EB strategy achieved superior prediction performance than other strategies on the novel test dataset. The preparation, calibration, and validation efforts for implementing the EB approach were demonstrated. The proposed methodology is recommended as a robust approach for benchmarking urban-scale building EUI, distinguished from other machine learning approaches by requiring fewer training samples, better interpreting uncertainties in the data used, and exhibiting superior generalization capabilities. The proposed methodology was successfully demonstrated through the urban-scale benchmarking of space heating EUI in Beijing.