An empirical study on the effectiveness of data resampling approaches for cross‐project software defect prediction

Kwabena Ebo Bennin,Jürgen Börstler,Stephen G. MacDonell,Amjed Tahir

doi:10.1049/sfw2.12052

Abstract

Cross-project defect prediction (CPDP), where data from different software projects are used to predict defects, has been proposed as a way to provide data for software projects that lack historical data. Evaluations of CPDP models using the Nearest Neighbour (NN) Filter approach have shown promising results in recent studies. A key challenge with defect-prediction datasets is class imbalance, that is, highly skewed datasets where non-buggy modules dominate the buggy modules. In the past, data resampling approaches have been applied to within-projects defect prediction models to help alleviate the negative effects of class imbalance in the datasets. To address the class imbalance issue in CPDP, the authors assess the impact of data resampling approaches on CPDP models after the NN Filter is applied. The impact on prediction performance of five oversampling approaches (MAHAKIL, SMOTE, Borderline-SMOTE, Random Oversampling and ADASYN) and three undersampling approaches (Random Undersampling, Tomek Links and One-sided selection) is investigated and results are compared to approaches without data resampling. The authors examined six defect prediction models on 34 datasets extracted from the PROMISE repository. The authors' results show that there is a significant positive effect of data resampling on CPDP performance, suggesting that software quality teams and researchers should consider applying data resampling approaches for improved recall (pd) and g-measure prediction performance. However, if the goal is to improve precision and reduce false alarm (pf) then data resampling approaches should be avoided.

Highlights

Defect prediction models can help to identify defective software components and thereby support managers in resource allocation
RQ1: What is the impact of data resampling approaches on Nearest Neighbour (NN)‐filtered datasets in cross‐project defect prediction (CPDP)?
We first evaluate the performance of using an NN filter for CPDP, and we investigate the influence data resampling approaches on the performance of CPDP models after filtering the training datasets

Summary

Introduction

Defect prediction models can help to identify defective software components and thereby support managers in resource allocation. A promising approach to handle this issue is to use a cross‐company or cross‐project defect prediction (CPDP), where data from other companies or projects are used for model training. This approach proposed by Turhan et al [11] is a relevancy filter that effectively selects the closest data instances with respect to the new target project from a collection of various projects based on the K‐NN algorithm. It is a pre‐processing approach that could be combined with the normal classification process. For each module in the target project, find k neighbours from the combined training data considering their pairwise Euclidean distances

Objectives

Methods

Results

Conclusion