Abstract. An increasing number of climate model simulations is becoming available for the transition from the Last Glacial Maximum to the Holocene. Assessing the simulations' reliability requires benchmarking against environmental proxy records. To date, no established method exists to compare these two data sources in space and time over a period with changing background conditions. Here, we develop a new algorithm to rank simulations according to their deviation from reconstructed magnitudes and temporal patterns of orbital and millennial-scale temperature variations. The use of proxy forward modeling allows for accounting for non-climatic processes that affect the temperature reconstructions. It further avoids the need to reconstruct gridded fields or regional mean temperature time series from sparse and uncertain proxy data. First, we test the reliability and robustness of our algorithm in idealized experiments with prescribed deglacial temperature histories. We quantify the influence of limited temporal resolution, chronological uncertainties, and non-climatic processes by constructing noisy pseudo-proxies. While model–data comparison results become less reliable with increasing uncertainties, we find that the algorithm discriminates well between simulations under realistic non-climatic noise levels. To obtain reliable and robust rankings, we advise spatial averaging of the results for individual proxy records. Second, we demonstrate our method by quantifying the deviations between an ensemble of transient deglacial simulations and a global compilation of sea surface temperature reconstructions. The ranking of the simulations differs substantially between the considered regions and timescales, which suggests that optimizing for agreement with the temporal patterns of a small set of proxies might be insufficient for capturing the spatial structure of the deglacial temperature variability. We attribute the diversity in the rankings to more regionally confined temperature variations in reconstructions than in simulations, which could be the result of uncertainties in boundary conditions, shortcomings in models, or regionally varying characteristics of reconstructions such as recording seasons and depths. Future work towards disentangling these potential reasons can leverage the flexible design of our algorithm and its demonstrated ability to identify varying levels of model–data agreement. Additionally, the algorithm can be applied to variables like oxygen isotopes and climate transitions such as the penultimate deglaciation and the last glacial inception.