Cross-Project Software Defect Prediction Based on Class Code Similarity

Wanzhi Wen,Haoren Wang,Ruinian Zhang,Xiaohong Lu,Zhixian Li,Ningbo Zhu,Chenqiang Shen

doi:10.1109/access.2022.3211401

Abstract

Software defect prediction techniques can help software developers find software defects as soon as possible. It can also reduce the cost of software development. This technique usually predicts the target project through the entire source project. However, the data distribution difference between the entire source project and the target project is generally large, so the software defect prediction accuracy is not high. we propose a cross-project software defect prediction technique based on class code similarity CCS-CPDP. Firstly, this technique converts the code set extracted by AST(Abstract Syntax Tree) into a vector set through the DTI (Doc2Bow and TF-IDF) strategy; Secondly, the similarity will be calculated between the vector set of target items and training items; Finally, according to the principle of the majority decision subordinate category in KNN, the number of most similar class instances of the training item is determined, and the source item is refined by selecting the class instance, thereby we can realize software defect prediction. This method compares with four traditional classification models (KNN, Random Forest, Naive Bayes, and Logistic Regression) for defect prediction. Experimental results show that CCS-CPDP compared with the baseline, recall and f1 increased by 18.03% and 14.1% respectively. In addition, the refined source projects selected by CCS-CPDP compared with the current source project selection technology, recall and f1 had improved by 37.6% and 12.7%.

Full Text