An Empirical Study on The Impact of The Interaction between Feature Selection and Sampling in Defect Prediction

Shuyue Fan,Chun Liu,Zheng Li

doi:10.1109/dsa51864.2020.00025

Abstract

With the gradual growth of software engineering, it is more and more important to predict the defects in software in advance, which can reduce the major human and financial losses caused by software defects. There are two main problems in software defect prediction: feature redundancy and class imbalance (the number of defective modules and no defective modules is unbalanced, the number of no-defective modules is much higher than defective modules). There are different feature selection methods and sampling methods to address issues. However, the task of feature selection and sampling interact with each other has not been studied. We study the impact of the interaction in terms of the five performance measures of AUC, accuracy, precision, recall and F1 of defect prediction. Chi-square, IG (Information Gain) and relief are used for feature selection to remove redundant features, SMOTE (synthetic minority oversampling method) and RUS (random under-sampling) are used to sample the defective data and four classifiers, NB (Naive Bayes), LR (Logistic Regression), DT and SVM (Support Vector Machine) are used. The studying result shows that when chi-square and IG are used for feature selection, it is better to sample first for both sampling methods in terms of AUC and recall, when relief is used and NB, LR and SVM are used, it is a good choice to perform feature selection first in terms of AUC and recall, and it is better to sample first when using DT classification.

Full Text