Oversampling boosting for classification of imbalanced software defect data

Guangling Li,Shihai Wang

doi:10.1109/chicc.2016.7554000

Abstract

In the community of software defect prediction, a common and significant problem is the data imbalance, which is caused by the fact that the non-defect prone modules are much larger than the defect prone modules. This problem makes most of the typical classifiers, such as LR, SVM, Decision tree, Boosting, etc., prefer to the majority class, non-defect prone modules. In most cases, however, we are more interested in the minority class, defect prone modules, as we want to detect more defect prone modules. In order to improve the ability of identifying the minority class, we propose an adaptive weight updating scheme based on AdaBoost. We first, employ SMOTE or any other synthetic samples generation methods to balance the training datasets. Then, every synthetic sample is given a penalty factor adaptively according to sample's density. The penalty factor is introduced into the cost function to adjust samples' weights so that the base classifiers are guided adaptively to learn the reliable synthetic samples instead of noisy samples. Finally, a more reliable classifier is produced, and the accuracy of the minority class is increased. A series of experiments on MDP, a NASA software defect datasets, is performed, and the results demonstrate the effectiveness of our method.

Full Text