An Empirical Study on Data Sampling Methods in Addressing Class Imbalance Problem in Software Defect Prediction

Babajide J Odejide,Kayode S Adewole,Hammed A Mojeed,Abdullateef O Balogun,Shakirat A Salihu,Fatima E Usman-Hamza,Zubair O Alanamu,Amos O Bajeh,Abimbola G Akintola

doi:10.1007/978-3-031-09070-7_49

Abstract

AbstractWith the growing rate of software systems and their applications in diverse walks of life, developing a software system that has no defects is a subject that cannot be overemphasized. Detection of software defects is one of the most prominent difficulties in the area of software engineering (SE) or software development process. Defects are usually unconscious flaws that make the software system behave unexpectedly or contrary to the specified requirements. This has made the subject of software defect prediction (SDP) a very critical one. Due to their dynamism, SDP solutions based on machine learning (ML) methods are envisaged as a viable approach. However, the latent data quality problem is a significant challenge to developing effective SDP models. The class imbalance is a classic example of the data quality problem in which there is a huge differential in the number of class (majority and minority) labels. Findings from studies have shown that data sampling methods are capable of addressing the class imbalance problem. Hence, this study conducts an empirical comparative analysis on the effect of data sampling methods in addressing the class imbalance problem inherent in SDP. Specifically, the performance of five data sampling (oversampling techniques (SMOTE, ADASYN, and ROS) and undersampling techniques (RUS and NM) methods on four software defect datasets with varying granularities are investigated. As prediction models, decision tree (DT) and random forest (RF) classifiers are deployed as well. Predictive performances of developed models were evaluated using accuracy, the area under the curve (AUC), and Matthews correlation coefficient (MCC) values. Observations from the experimental results showed that the introduction of data sampling methods in SDP processes not only addresses the class imbalance problem but also improves the prediction performances of the experimented classifiers. In addition, models based on ROS resampled datasets had superior predictive performance compared with other studied data sampling-based datasets. In conclusion, it can therefore be recommended to deploy data sampling methods, particularly oversampling methods in SDP processes and other applicable machine learning tasks.KeywordsSoftware defect predictionClass imbalanceData samplingMachine learning

Full Text