Commercial building energy benchmarking has been used as a mechanism to evaluate energy use of a single building over time, relative to other similar buildings, or to simulations of a reference building conforming to various energy standards. Lack of empirical demand flexibility data and consistent flexibility metrics has limited the ability to compare demand flexibility performance with estimated demand flexibility in buildings. In this study, we collected demand response performance data for a total of 831 demand response events from 192 sites as a first step to build such a building demand flexibility dataset, and propose a standard core data schema to consolidate field data from different sources. We also performed parametric simulations of a control strategy called “global temperature adjustment” using commercial office prototype building models. We then compared the simulated demand flexibility performance against the actual data for offices with global temperature adjustment strategy implemented. During demand response events with an average outside air temperature of 34 °C (range 23 °C–42 °C), the measured demand decrease intensity of the demand flexibility metrics were 6.1 watts per square meter (W/m2), 10.0 W/m2, 11.1 W/m2, 7.1 W/m2, and 4.7 W/m2 for small, small–medium, medium, medium–large, and large office buildings, respectively. Compared to the measured data in medium- and large-size buildings, the simulated demand decrease intensity was 0.7 W/m2 (17%) lower on average. The discrepancy between simulated and measured peak demand intensities fell within one standard deviation of the mean measured data. The comparison results validate the credibility of simulations in capturing real building data for assessing the technical potential of building demand flexibility.