Unsupervised Learning to Heterogeneous Cross Software Projects Defect Prediction

Rohit Vashisht,Syed Afzal Murtaza Rizvi

doi:10.1007/978-981-16-2594-7_54

Abstract

Heterogeneous Cross-Project Defect Prediction (HCPDP) aims to predict defects with insufficient historical defect data in a target project through a defect prediction (DP) model trained using another source project. It doesn’t demand the same set of metrics between two applications, and it also builds DP model based on matched heterogeneous metrics showing analogous distribution in their values for a given pair of datasets. This paper proposes a novel HCPDP model consisting of four phases:data preprocessing phase, feature engineering phase, metric matching phase, and lastly, training and testing phase. One may employ supervised and unsupervised learning techniques to train the DP model. Supervised method of learning uses tagged data or well-defined instances to train the model. On the other hand, unsupervised learning techniques attempt to train the model by identifying specific hidden patterns in the distribution of unlabeled instance’s values. The advantage of using unlabeled data is that, it is easier to get from a machine than labeled data, which requires manual efforts. This paper empirically and theoretically assesses the impact of the training process on the efficiency of the HCPDP model using an unsupervised learning method. Beyond this, a comparative study has been done among HCPDP with supervised learning, HCPDP with unsupervised learning, and the standard DP approach, i.e., WithIn-Project Defect Prediction (WPDP). Logistic Regression and Km++ Clustering are used as supervised and unsupervised techniques, respectively. Results show that for both classes of DP, HCPDP, and WPDP, unsupervised learning method demonstrates comparable performance compared to supervised learning method.KeywordsCross projectUnsupervised learningHeterogeneousSoftware metric

Full Text