Abstract

Software defect prediction (SDP) is an effective technique to lower software module testing costs. However, the imbalanced distribution almost exists in all SDP datasets and restricts the accuracy of defect prediction. In order to balance the data distribution reasonably, we propose a novel resampling method LIMCR on the basis of Naïve Bayes to optimize and improve the SDP performance. The main idea of LIMCR is to remove less-informative majorities for rebalancing the data distribution after evaluating the degree of being informative for every sample from the majority class. We employ 29 SDP datasets from the PROMISE and NASA dataset and divide them into two parts, the small sample size (the amount of data is smaller than 1100) and the large sample size (larger than 1100). Then we conduct experiments by comparing the matching of classifiers and imbalance learning methods on small datasets and large datasets, respectively. The results show the effectiveness of LIMCR, and LIMCR+GNB performs better than other methods on small datasets while not brilliant on large datasets.

Highlights

  • Software defect prediction (SDP) is an effective technique to lower software module testing costs.It can efficiently identify defect-prone software modules by learning information from defect datasets of the previous release

  • We present a novel resampling method LIMCR based on Naïve Bayes to solve the class imbalance problem in SDP datasets

  • We notice that the average balancedscore and G-mean of LIMCR are 0.701 and 0.69 which perfoms better than other baseline imbalance learning methods

Read more

Summary

Introduction

Software defect prediction (SDP) is an effective technique to lower software module testing costs.It can efficiently identify defect-prone software modules by learning information from defect datasets of the previous release. Most prediction algorithms assume that the number of samples in any class are balanced This contradiction makes the prediction algorithms trained in imbalanced software defect datasets are generally biased towards the samples in non-defect-prone classes and ignore the samples in defect-prone classes, i.e., many defect-prone samples might be classified into non-defect-prone class based on prediction algorithms trained by imbalanced datasets. This problem widely occurs in SDP and it has proved that reducing the influence of the imbalance problem can improve prediction performance efficiently

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call