Software Defect Prediction Model Sharing Under Differential Privacy

Dun Zhang,Xiang Chen,Zhanqi Cui,Xiaolin Ju

doi:10.1109/smartworld.2018.00266

Abstract

In current software defect prediction (SDP) research, most empirical studies only use data sets provided by Promise repository and this may cause a threat for external validity. Instead of SDP data set sharing, SDP model sharing is a potential solution to alleviate this problem and can encourage researchers to share more models. However, sharing models directly may result in the disclosure of privacy, such as model inversion attack. To the best of our knowledge, we are the first to apply differential privacy (DP) to SDP model sharing and propose a novel method A-DPRF, since DP mechanisms can prevent this attack when the privacy budget is carefully selected. In particular, this method first performs data preprocessing for the data set, such as over-sampling for minority instances (i.e., faulty modules) and discretization for continuous features. Then it uses a novel sampling strategy to create a set of training sets. Finally it constructs decision trees based on these training sets and these decision trees can form a random forest (i.e., model). The last two steps of A-DPRF use Laplace and exponential mechanisms to satisfy the requirement of DP. In our empirical studies, we choose experimental subjects from real software projects. Then we use AUC as the performance measure and holdout as our model validation technique. After privacy and utility analysis, we find that A-DPRF can achieve better performance than a baseline method B-DPRF in most cases when using the same privacy budget.

Full Text