An Empirical Investigation of Combining Filter-Based Feature Subset Selection and Data Sampling for Software Defect Prediction

Kehan Gao,Taghi M Khoshgoftaar,Amri Napolitano

doi:10.1142/s0218539315500278

Abstract

The main goal of software quality engineering is to produce a high-quality software product through the use of various techniques and processes. Classification models are effective tools for software quality prediction, helping practitioners to detect potentially problematic modules and eventually improve software product. However, two potential problems, high dimensionality and class imbalance, may affect the classifiers performance. In this study, we propose a data pre-processing approach, in which feature selection is combined with data sampling, to overcome these problems. We investigate two filter-based feature subsets selection techniques, i.e., correlation-based and consistency-based subset evaluation methods, and three data sampling methods, i.e., random undersampling, random oversampling, and synthetic minority oversampling. We are interested in exploring the effect of the various feature selection techniques, sampling methods, and their interactions on the performance of classification models. The empirical studies were carried out on 13 datasets from two real-world software systems. The results demonstrate that the correlation-based subset evaluation technique outperformed the consistency-based method when they were used along with a random sampling method and when the training data had a high degree of class imbalance; however, when synthetic minority oversampling was employed or when the training dataset was less imbalanced, the consistency-based technique had better performance than the correlation-based approach.

Full Text