Abstract

Software fault prediction is a very consequent research topic for software quality assurance. The performance of fault prediction model depends on the features that are used to train it. Redundant and irrelevant features can hinder the performance of a classification model. In this paper, we propose an empirical study of two-stage data pre-processing technique on software fault prediction models. In the first stage, a novel semi-supervised deep Fuzzy C-Mean (DFCM) clustering-based feature extraction technique is proposed to create new features by utilizing deep multi-clusters of unlabeled and labeled data sets that tends to maximize intra-cluster class and intra-cluster feature by using FCM clustering. The FCM also utilizes to handle the class imbalance problem. In the second stage, we further ameliorate the prediction performance with coalescence of feature selection (using random-under sampling) to reduce the noisy data for classification. However, by the performance of the model results in the amalgamation of novel DFCM data pre-processing approach work better due to their ability to identify and amalgamation essential information in data features. An empirical study is designed on real-world software project (NASA & Eclipse) data set to evaluate the performance of DFCM by implemented different data pre-processing schemes on prediction models (C4.5, naive bayes, and 1-near neighbor (1-NN)), which are widely used in software fault prediction and further investigated the influencing factors in our approach. The result shows that the performance of the proposed DFCM feature extraction technique for data pre-processing is stable and effectiveness on all prediction models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call