ConCPDP: A Cross‐Project Defect Prediction Method Integrating Contrastive Pretraining and Category Boundary Adjustment

Hengjie Song,Yufei Pan,Feng Guo,Xue Zhang,Le Ma,Siyu Jiang

doi:10.1049/2024/5102699

Hengjie Song, Yufei Pan + Show 4 more

https://doi.org/10.1049/2024/5102699

Copy DOI

Export

Save

Cite

Journal: IET Software	Publication Date: Jan 1, 2024
License type: CC BY 4.0

Abstract
Full-Text
Similar Papers

Abstract

Listen

Software defect prediction (SDP) is a crucial phase preceding the launch of software products. Cross‐project defect prediction (CPDP) is introduced for the anticipation of defects in novel projects lacking defect labels. CPDP can use defect information of mature projects to speed up defect prediction for new projects. So that developers can quickly get the defect information of the new project, so that they can test the software project pertinently. At present, the predominant approaches in CPDP rely on deep learning, and the performance of the ultimate model is notably affected by the quality of the training dataset. However, the dataset of CPDP not only has few samples but also has almost no label information in new projects, which makes the general deep‐learning‐based CPDP model not ideal. In addition, most of the current CPDP models do not fully consider the enrichment of classification boundary samples after cross‐domain, leading to suboptimal predictive capabilities of the model. To overcome these obstacles, we present contrastive learning pretraining for CPDP (ConCPDP), a CPDP method integrating contrastive pretraining and category boundary adjustment. We first perform data augmentation on the source and target domain code files and then extract the enhanced data as an abstract syntax tree (AST). The AST is then transformed into an integer sequence using specific mapping rules, serving as input for the subsequent neural network. A neural network based on bidirectional long short‐term memory (Bi‐LSTM) will receive an integer sequence and output a feature vector. Then, the feature vectors are input into the contrastive module to optimise the feature extraction network. The pretrained feature extractor can be fine‐tuned by the maximum mean discrepancy (MMD) between the feature distribution of the source domain and the target domain and the binary classification loss on the source domain. This paper conducts a large number of experiments on the PROMISE dataset, which is commonly used for CPDP, to validate ConCPDP’s efficacy, achieving superior results in terms of F1 measure, area under curve (AUC), and Matthew’s correlation coefficient (MCC).

Full Text