On the assessment of software defect prediction models via ROC curves

Sandro Morasca,Luigi Lavazza

doi:10.1007/s10664-020-09861-4

Sandro Morasca, Luigi Lavazza

Open Access

https://doi.org/10.1007/s10664-020-09861-4

Copy DOI

Journal: Empirical Software Engineering	Publication Date: Aug 19, 2020
Citations: 24	License type: open-access

Affiliation: University of Insubria

Abstract

Software defect prediction models are classifiers often built by setting a threshold t on a defect proneness model, i.e., a scoring function. For instance, they classify a software module non-faulty if its defect proneness is below t and positive otherwise. Different values of t may lead to different defect prediction models, possibly with very different performance levels. Receiver Operating Characteristic (ROC) curves provide an overall assessment of a defect proneness model, by taking into account all possible values of t and thus all defect prediction models that can be built based on it. However, using a defect proneness model with a value of t is sensible only if the resulting defect prediction model has a performance that is at least as good as some minimal performance level that depends on practitioners’ and researchers’ goals and needs. We introduce a new approach and a new performance metric (the Ratio of Relevant Areas) for assessing a defect proneness model by taking into account only the parts of a ROC curve corresponding to values of t for which defect proneness models have higher performance than some reference value. We provide the practical motivations and theoretical underpinnings for our approach, by: 1) showing how it addresses the shortcomings of existing performance metrics like the Area Under the Curve and Gini’s coefficient; 2) deriving reference values based on random defect prediction policies, in addition to deterministic ones; 3) showing how the approach works with several performance metrics (e.g., Precision and Recall) and their combinations; 4) studying misclassification costs and providing a general upper bound for the cost related to the use of any defect proneness model; 5) showing the relationships between misclassification costs and performance metrics. We also carried out a comprehensive empirical study on real-life data from the SEACRAFT repository, to show the differences between our metric and the existing ones and how more reliable and less misleading our metric can be.

Highlights

Accurate estimation of which modules are faulty in a software system can be very useful to software practitioners and researchers
Several techniques have been proposed and applied in the literature for estimating whether a module is faulty (Beecham et al 2010a; b; Hall et al 2012; Malhotra 2015; Radjenovicet al. 2013). We focus on those techniques that define defect prediction models (i.e., binary classifiers (Fawcett 2006)) by setting a threshold t on a defect proneness model (Huang et al 2019), i.e., a scoring classifier that uses a set of independent variables
– We show that the Area Under the Curve (AUC) and Gini’s coefficient (G) (Gini 1912) and other proposals are special cases of Relevant Areas (RRA), which, account for parts of the Receiver Operating Characteristic (ROC) curve corresponding to thresholds for which it is not worthwhile to build defect prediction models

Summary

Introduction

Accurate estimation of which modules are faulty in a software system can be very useful to software practitioners and researchers. Researchers need to use quantitative, accurate module defect prediction techniques so they can assess and subsequently improve software development methods. We focus on those techniques that define defect prediction models (i.e., binary classifiers (Fawcett 2006)) by setting a threshold t on a defect proneness model (Huang et al 2019), i.e., a scoring classifier that uses a set of independent variables. If the defect proneness model computes the probability that a module is faulty, a defect prediction model estimates a module faulty if its probability of being faulty is above or equal to t. The issue of defining the value of t has been addressed by several approaches in the literature (for instance, Alves et al (2010), Erni and Lewerentz (1996), Morasca and Lavazza (2017), Schneidewind (2001), Shatnawi (2010), and Tosun and Bener (2009))

Objectives

Methods

Results

Conclusion