Cross-project defect prediction using data sampling for class imbalance learning: an empirical study

Lipika Goel,Mayank Sharma,Sunil Kumar Khatri,D Damodaran

doi:10.1080/17445760.2019.1650039

Abstract

The presence of defect data related to different projects leads to cross-project defect prediction an open issue in the field of research in software engineering. In cross-project defect prediction, the source and the target projects are different. The prediction model is trained by using the data sources of the different projects and then it is tested on the target data source. The data source from the varying projects leads to a highly imbalanced source dataset. The performance of the predictive model degrades due to this imbalance nature of the dataset. This is termed as the class imbalance problem in machine learning. This paper conducts an empirical analysis in a bi-fold manner. It evaluates whether data sampling techniques can handle the class imbalance problem and improve the performance of the predictive model for cross-project defect prediction (CPDP). Secondly, it also evaluates whether the results of CPDP after data sampling are comparable to within project defect prediction (WPDP). Ensemble learning classifiers are used as the predictive model over 12 publically available object-oriented project datasets. The experimental results infer that SMOTE oversampling can be applied to overcome the problem of class imbalance on CPDP. It also gives comparable results to WPDP with statistical significance.

Full Text