Evaluating machine learning methods: scored receiver operating characteristics (sroc) curves

William Klement

doi:10.20381/ruor-20085

Abstract

This thesis addresses evaluation methods used to measure the performance of machine learning algorithms. In supervised learning, algorithms are designed to perform common learning tasks including classification, ranking, scoring, and probability estimation. This work investigates how information, produced by these various learning tasks, can be utilized by the performance evaluation measure. In the literature, researchers recommend evaluating classification and ranking tasks using the Receiver Operating Characteristics (ROC) curve. In a scoring task, the learning model estimates scores, from the training data, and assigns them to the testing data. These scores are used to express class memberships. Sometimes, these scores represent probabilities in which case the Mean Squared Error (Brier Score) is used to measure their quality. However, if these scores are not probabilities, the task is reduced to a ranking or a classification task by ignoring them. The standard ROC curve also eliminates such scores from its analysis. We claim that using non-probabilistic scores as probabilities is often incorrect, and doing it properly would mean imposing additional assumptions on the algorithm or on the data. Ignoring these scores fully, however, is also problematic since, in practice, although they may provide a poor estimate of probabilities, their magnitudes, nonetheless, provide information that can be valuable for performance analysis. The purpose of this dissertation is to propose a novel method that extends the ROC curve to include such scores. We, therefore, call it the scored ROC curve. In particular, we develop a method to construct a scored ROC curve, demonstrate how to reduce it to a standard ROC curve, and illustrate how it can be used to compare learning models. Our experiments demonstrate that the scored ROC curve is capable of measuring similarities as well as differences in the performance of different learning models, and is more sensitive to them than the standard ROC curve. In addition, we illustrate our method's ability to detect changes in data distribution between training and testing.

Full Text