The application of ROC analysis in threshold identification, data imbalance and metrics selection for software fault prediction

Raed Shatnawi

doi:10.1007/s11334-017-0295-0

Abstract

Software engineers have limited resources and need metrics analysis tools to investigate software quality such as fault-proneness of modules. There are a large number of software metrics available to investigate quality. However, not all metrics are strongly correlated with faults. In addition, software fault data are imbalanced and affect quality assessment tools such as fault prediction or threshold values that are used to identify risky modules. Software quality is investigated for three purposes. First, the receiver operating characteristics (ROC) analysis is used to identify threshold values to identify risky modules. Second, the ROC analysis is investigated for imbalanced data. Third, the ROC analysis is considered for feature selection. This work validated the use of ROC to identify thresholds for four metrics (WMC, CBO, RFC and LCOM). The ROC results after sampling the data are not significantly different from before sampling. The ROC analysis selects the same metrics (WMC, CBO and RFC) in most datasets, while other techniques have a large variation in selecting metrics.

Full Text