Improving Software Fault Prediction With Threshold Values

Raed Shatnawi

doi:10.23919/softcom.2018.8555818

Abstract

Software fault prediction has been studied by many researchers to assess the quality of software and to predict where faults may appear in future. However, the performance of fault prediction degrades because of many reasons including unlabeled instances or data imbalance, i.e., modules that contain faults are minority. The data imbalance is common in fault data where the majority of software modules are marked as non-faulty. However, part of these modules are still fault-prone but faults are uncovered yet. Threshold values are used to identify the modules that are complex and more fault prone. The fault prediction models are combined with the use of threshold values to improve the prediction performance. Fault prediction models are built in two phases. First, threshold values are used to spot the most fault prone modules. The modules that have metrics larger than thresholds and were fault free are classified as medium, while modules with faults are classified as high. Second, the new data are used to build prediction models using five machine learning models. Five classifiers were built for ten software systems. We have found improvements in the classification performance of all classifiers when compared with traditional classification.

Full Text