Heterogeneous Defect Prediction Based on Federated Transfer Learning via Knowledge Distillation

Aili Wang,Yixin Yan,Yutong Zhang

doi:10.1109/access.2021.3058886

Abstract

Heterogeneous defect prediction (HDP) aims to predict defect-prone software modules in one project using heterogeneous data collected from other projects. There are two characteristics of defect data: data islands, and data privacy. In this article, we propose a novel Federated Transfer Learning via Knowledge Distillation (FTLKD) approach for HDP, which takes into consideration two characteristics of defect data. Firstly, Shamir sharing technology achieves homomorphic encryption for private data. During subsequent processing and operations, data remains encrypted all the time. Secondly, each participant uses public data to train convolutional neural networks(CNN), the parameters of the pre-trained CNN are transferred to a private model. A small amount of labeled private data fine-tunes the private model. Finally, knowledge distillation realizes the communication between the participants. The average of all softmax output (logits) is used for knowledge distillation to update the private models. Extensive experiments on 9 projects in 3 public databases (NASA, AEEEM and SOFTLAB) show that FTLKD outperforms the related competing methods.

Highlights

With the availability of massive storage capabilities, high speed Internet, and the advent of Internet of Things devices, modern software systems are growing in both size and complexity [1]
PROPOSED APPROACH The method framework of Federated Transfer Learning via Knowledge Distillation (FTLKD) proposed in this article is shown in FIGURE 1, which is based on the idea of federal transfer learning
In order to solve the data scarcity, island and privacy problems in heterogeneous defect prediction, this article proposes a heterogeneous defect prediction method based on federal transfer learning

Summary

INTRODUCTION

With the availability of massive storage capabilities, high speed Internet, and the advent of Internet of Things devices, modern software systems are growing in both size and complexity [1]. HDP methods based on Bayesian [3], meta-learning [4], transfer learning [5] and so on, can learn a high-quality defect prediction model from multiple source projects. Cheng et al proposed SecureBoost, a security enhancement method for vertical federated learning, which is a novel, performance-free, privacy-protected upgrade tree architecture It has the same degree of accuracy as a privacyfree approach [11]. According to the public dataset and the private dataset with a small number of labels, the initial private model of each participant is constructed through a combination of transfer learning and fine-tuning It solves the problem of insufficient labels, and builds models to match the different distribution of private data. The private models are independent of each other and do not share the hyper parameters of the models and contact the underlying data

PROPOSED APPROACH

KNOWLEDGE DISTILLATION

18. DLi is used to continue training the private models to model convergence

CONCLUSION