Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems

Thanh Tung Khuat,My Hanh Le

doi:10.1007/s42979-020-0119-4

Abstract

Defect prediction in software projects plays a crucial role to reduce quality-based risk and increase the capability of detecting faulty program modules. Hence, classification approaches to anticipate software defect proneness based on static code characteristics have become a hot topic with a great deal of attention in recent years. While several novel studies show that the use of a single classifier causes the performance bottleneck, ensembles of classifiers might effectively enhance classification performance compared to a single classifier. However, the class imbalance property of software defect data severely hinders the classification efficiency of ensemble learning. To cope with this problem, resampling methods are usually combined into ensemble models. This paper empirically assesses the importance of sampling with regard to ensembles of various classifiers on imbalanced data in software defect prediction problems. Extensive experiments with the combination of seven different kinds of classification algorithms, three sampling methods, and two balanced data learning schemata were conducted over ten datasets. Empirical results indicated the positive effects of combining sampling techniques and the ensemble learning model on the performance of defect prediction regarding datasets with imbalanced class distributions.

Full Text