An Empirical Study of Training Data Selection Methods for Ranking-Oriented Cross-Project Defect Prediction.

Haoyu Luo,Wenhua Hu,Heng Dai,Fuyang Li,Weiqiang Peng

doi:10.3390/s21227535

Abstract

Ranking-oriented cross-project defect prediction (ROCPDP), which ranks software modules of a new target industrial project based on the predicted defect number or density, has been suggested in the literature. A major concern of ROCPDP is the distribution difference between the source project (aka. within-project) data and target project (aka. cross-project) data, which evidently degrades prediction performance. To investigate the impacts of training data selection methods on the performances of ROCPDP models, we examined the practical effects of nine training data selection methods, including a global filter, which does not filter out any cross-project data. Additionally, the prediction performances of ROCPDP models trained on the filtered cross-project data using the training data selection methods were compared with those of ranking-oriented within-project defect prediction (ROWPDP) models trained on sufficient and limited within-project data. Eleven available defect datasets from the industrial projects were considered and evaluated using two ranking performance measures, i.e., FPA and Norm(Popt). The results showed no statistically significant differences among these nine training data selection methods in terms of FPA and Norm(Popt). The performances of ROCPDP models trained on filtered cross-project data were not comparable with those of ROWPDP models trained on sufficient historical within-project data. However, ROCPDP models trained on filtered cross-project data achieved better performance values than ROWPDP models trained on limited historical within-project data. Therefore, we recommended that software quality teams exploit other project datasets to perform ROCPDP when there is no or limited within-project data.

Highlights

Software defect prediction (SDP), known as software fault prediction, is a research hotspot, which has drawn lots of attention from both industry and academia [1,2]
The experiment results indicate that the performances of the Ranking-oriented cross-project defect prediction (ROCPDP) models trained on filter CP data are not comparable with those of ranking-oriented within-project defect prediction (ROWPDP) models trained on sufficient historical WP data in terms of FPA and Norm(Popt); and ROCPDP models trained on filtered CP data achieved better results than ROWPDP models trained on limited historical WP data
We wanted to investigate whether using training data selection methods and ROCPDP models can achieve performances comparable with those of ROWPDP models trained on sufficient historical WP data

Summary

Introduction

Software defect prediction (SDP), known as software fault prediction, is a research hotspot, which has drawn lots of attention from both industry and academia [1,2]. Defect prediction recognizes the appearance of defects in the system or industrial software, which provides support to find the category, location, and scale of defects [3,4,5,6,7]. It has long been recognized as one of the important aspects of improving the reliability of industrial system software [8,9,10]. The general method of software defect prediction models is to learn a classification model from the historical datasets via the machine learning algorithms, and predict whether new software modules contain bugs [11]. The accurate prediction results can contribute to the allocation of reasonable testing resources by focusing on those predicted defect-prone modules [12,13]

Methods

Results

Discussion

Conclusion