Abstract

In the current stage, software defect prediction is suffering the imbalanced data problem. Traditional methods are insensitive to defect-prone modules and tend to predict defect-prone modules as defect-free modules. To deal with this problem, sampling techniques are adopted to rebalance the defect-prone and defect-free data to train the predictive model in order to improve the performance. However, it is not clear on the combined effect of the sampling techniques and the machine learning classifiers on the performance of software defect prediction. The intent of the paper is to study the performance impact on defect prediction incurred by different combinations of sampling techniques and machine learning classifiers. Specifically, we investigate three types of sampling techniques as resampling, spread subsampling and SMOTE (Synthetic Minority Over-sampling Technique), and five types of machine learning classifiers as C4.5, naive Bayes, logistic regression, support vector machine and deep learning to study their combined effect on defect prediction. By using the Friedman test and Nemenyi test, we find that there isn't an optimal method among all the 12 combinations in defect prediction. However, support vector machine and deep learning have produced the best performance stably among all the investigated projects. With ANOVA analysis, we find that the sampling techniques have great impact on the outcomes of defect prediction because they produce different data distributions for model training. Nevertheless, the sampling proportion has significant impacts on TPR (True Positive Ratio) and FPR (False Positive Ratio) while it can merely influence the AUC (Area under Curve) and Balance of logistic regression. We explain the experimental results in the paper.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call