A comparative study on the effect of data imbalance on software defect prediction

Yanbin Liu,Wen Zhang,Guangjie Qin,Jiangpeng Zhao

doi:10.1016/j.procs.2022.11.349

Yanbin Liu, Wen Zhang + Show 2 more

Open Access

https://doi.org/10.1016/j.procs.2022.11.349

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

In the current stage, software defect prediction is suffering the imbalanced data problem. Traditional methods are insensitive to defect-prone modules and tend to predict defect-prone modules as defect-free modules. To deal with this problem, sampling techniques are adopted to rebalance the defect-prone and defect-free data to train the predictive model in order to improve the performance. However, it is not clear on the combined effect of the sampling techniques and the machine learning classifiers on the performance of software defect prediction. The intent of the paper is to study the performance impact on defect prediction incurred by different combinations of sampling techniques and machine learning classifiers. Specifically, we investigate three types of sampling techniques as resampling, spread subsampling and SMOTE (Synthetic Minority Over-sampling Technique), and five types of machine learning classifiers as C4.5, naive Bayes, logistic regression, support vector machine and deep learning to study their combined effect on defect prediction. By using the Friedman test and Nemenyi test, we find that there isn't an optimal method among all the 12 combinations in defect prediction. However, support vector machine and deep learning have produced the best performance stably among all the investigated projects. With ANOVA analysis, we find that the sampling techniques have great impact on the outcomes of defect prediction because they produce different data distributions for model training. Nevertheless, the sampling proportion has significant impacts on TPR (True Positive Ratio) and FPR (False Positive Ratio) while it can merely influence the AUC (Area under Curve) and Balance of logistic regression. We explain the experimental results in the paper.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Procedia Computer Science	Publication Date: Jan 1, 2022
Citations: 10	License type: cc-by-nc-nd

R Discovery Prime

A comparative study on the effect of data imbalance on software defect prediction

Abstract

Published Version

Talk to us

Similar Papers

More From: Procedia Computer Science

Lead the way for us

Similar Papers

Software defect prediction with semantic and structural information of codes based on Graph Neural Networks
Chunying Zhou ... Peng He
Information and Software Technology | VOL. 152
Chunying Zhou, et. al.Chunying Zhou ... Peng He
01 Dec 2022
Information and Software Technology | VOL. 152

Combining Particle Swarm Optimization based Feature Selection and Bagging Technique for Software Defect Prediction
Romi Satria Wahono ... Nanna Suryana
International Journal of Software Engineering and Its Applications | VOL. 7
Romi Satria Wahono, et. al.Romi Satria Wahono ... Nanna Suryana
30 Sep 2013
International Journal of Software Engineering and Its Applications | VOL. 7

Evaluating the effectiveness of decomposed Halstead Metrics in software fault prediction
Bilal Khan ... Aamer Nadeem
PeerJ Computer Science | VOL. 9
Bilal Khan, et. al.Bilal Khan ... Aamer Nadeem
27 Nov 2023
PeerJ Computer Science | VOL. 9

Research on Software Defect Prediction Framework Based on ISFLA in IoT Communication Software
Wenbin Bi ... Xiuli Han
Computers, Materials & Continua | VOL. 65
Wenbin Bi, et. al.Wenbin Bi ... Xiuli Han
01 Jan 2020
Computers, Materials & Continua | VOL. 65

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

A comparative study on the effect of data imbalance on software defect prediction

Abstract

Published Version

Talk to us

Similar Papers

More From: Procedia Computer Science