Conditional Domain Adversarial Adaptation for Heterogeneous Defect Prediction

Lina Gong,Shujuan Jiang,Li Jiang

doi:10.1109/access.2020.3017101

Abstract

Heterogeneous defect prediction (HDP) has become a very active research field in software engineering, which predicts the maximum number of bug-suspiciousness modules of a target project by prediction models built on source project with heterogeneous metric set. At present, some researchers have proposed some HDP models with a promising performance. Most of existing HDP models adopted unsupervised transfer learning to map source project and target project into the same feature space, which only considered the metrics space, not the label information from source project and few part of target project. Meanwhile, the predictive ability of these HDP models in effort-aware context have not been compared. Therefore, we set up to investigate the effectiveness of label information on HDP, and to propose a HDP model for improving the predicting performance in classification and effort-aware contexts. In order to use these label information, we propose a novel conditional domain adversarial adaptation (CDAA) approach to tackle heterogeneous problem in SDP, which is motivated by generative adversarial networks (GANs). There are three networks in architecture of our CDAA, including one generator, one discriminator and one classifier. The generator learns how to transfer source instance space to target instance space. The discriminator learns how to identify the fake instances generated by generator. The classifier learns how to correctly classify the label of instances. In our CDAA, the loss function of classifier and discriminator are both back propagate to generator. Then, to ensure a fair comparison between state-of-the art methods and CDAA, we take AUC, MCC and $P_{opt}$ as measures to evaluate 28 open-source projects. Experimental results demonstrate that CDAA method could take advantage of label information to effectively map source project to target project and improve the predictive performance. Also, experimental results demonstrate that our CDAA method is not affected by the number of same metrics between source project and target project.

Highlights

S OFTWARE defect prediction (SDP) aim to detect as many defective modules as possible in a software project by learning models trained on sufficient historical labeled instances [1], [2], [3], [4], [5], which has caused widespread attention from industrial communities and academic [6], [7]
We present the related work about SDP learning models in section II; We describe our proposed conditional domain adversarial adaptation (CDAA) method in section III; We describe our experimental setup in section IV; Experimental results and analysis are presented in section V; In section VI, we describe the threats to the validity of our approach; Section VII gives the conclusions and future directions
We observe that (i) The median values of AUC, Mattews correlation coefficient (MCC) and Popt measures obtained by CDAA method outperform compared Cross-project defect prediction (CPDP) methods of VCB-SVM [12], TCBoost [13], double transfer boosting (DTB) [14] and MNB [15] across all five repositories. (ii) The median values of AUC, MCC and Popt measures obtained by our CDAA method outperform unsupervised learning methods of spectral clustering (SC) across all five repositories, and outperforms ManualDown (MD) [36] method across 3/5 repositories. (iii) Except SOFTLAB repository, the median values of AUC, MCC and Popt measures obtained by Random forests (RF) classifier (WPDP method) outperform our CDAA and other compared methods. (IV) The median values of AUC, MCC and Popt measures obtained by our CDAA method outperform compared Heterogeneous defect prediction (HDP) methods of canonical correlation analysis (CCA)+ [17] and CT-KCCA [29]

Summary

Introduction

S OFTWARE defect prediction (SDP) aim to detect as many defective modules as possible in a software project by learning models trained on sufficient historical labeled instances [1], [2], [3], [4], [5], which has caused widespread attention from industrial communities and academic [6], [7]. Many CPDP approaches have been proposed to solve the difference, such as VCB-SVM [12], TCBoost [13], DTB [14], MNB [15], and HYDRA [9]. These methods assumed that train and test projects

Methods

Results

Conclusion