Benchmarks are an integral part of machine learning development. However, the most common benchmarks share several limitations. For example, the difference in performance between two models has no probabilistic interpretation, it makes no sense to compare such differences between data sets and there is no reference point that indicates a significant performance improvement. Here we introduce an Elo-based predictive power meta-score that is built on other performance measures and allows for interpretable comparisons of models. Differences between this score have a probabilistic interpretation and can be compared directly between data sets. Furthermore, this meta-score allows for an assessment of ranking fitness. We prove the properties of the Elo-based predictive power meta-score and support them with empirical results on a large-scale benchmark of 30 classification data sets. Additionally, we propose a unified benchmark ontology that provides a uniform description of benchmarks. The performance of machine learning models is usually compared via the mean value of a selected performance measure such as the area under the receiver operating characteristic curve on a specific benchmark data set. However, this measure, its mean value or even relative differences between models do not provide a good prediction of whether the results can translate to other data sets. Gosiewska and colleagues present here a comparison based on Elo ranking, which offers a probabilistic interpretation of how much better one model is than another.