Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study

Sushant Kumar Pandey,Anil Kumar Tripathi

doi:10.1109/icscc51209.2021.9528170

Abstract

Software practitioners are continuing to build advanced software defect prediction (SDP) models to help the tester find fault-prone modules. However, the Class Imbalance (CI) problem consists of uncommonly few defective instances, and more non-defective instances cause inconsistency in the performance. We have conducted 880 experiments to analyze the variation in the performance of 10 SDP models by concerning the class imbalance problem. In our experiments, we have used 22 public datasets consists of 41 software metrics, 10 baseline SDP methods, and 4 sampling techniques. We used Mathews Correlation Coefficient (MCC), which is more useful when a dataset is highly imbalanced. We have also compared the predictive performance of various ML models by applying 4 sampling techniques. To examine the performance of different SDP models, we have used the F-measure. We found the performance of the learning models is unsatisfactory, which needs to mitigate. We have also found a few surprising results, some logical patterns between classifier and sampling technique. It provides a connection between sampling technique, software matrices, and a classifier.

Full Text