Abstract
Heterogeneous defect prediction (HDP) aims to predict the defect tendency of modules in one project using heterogeneous data collected from other projects. It sufficiently incorporates the two characteristics of the defect prediction data: (1) datasets could have different metrics and distribution, and (2) data could be highly imbalanced. In this paper, we propose a few-shot learning based balanced distribution adaptation (FSLBDA) approach for heterogeneous defect prediction, which takes into consideration the two characteristics of the defect prediction data. Class imbalance of the defect datasets can be solved with undersampling, but the scale of the training datasets will be smaller. Specifically, we first remove redundant metrics of datasets with extreme gradient boosting. Then, we reduce the data difference between the source domain and the target domain with the balanced distribution adaptation. It considers the marginal distribution and the probability of conditional distribution differences and adaptively assigns different weights to them. Finally, we use adaptive boosting to relieve the influence caused by the size of the training dataset is smaller, which can improve the accuracy of the defect prediction model. We conduct experiments on 17 projects from 4 datasets using 3 indicators (i.e., AUC, G-mean, F-measure). Compared to three classic approaches, the experimental results show that FSLBDA can effectively improve the prediction performance.
Highlights
With the availability of massive storage capabilities, high speed Internet, and the advent of Internet of Things devices, modern software systems are growing in both size and complexity [1]
Wang et al proposed a balanced distribution adaptation method (BDA) [20], which can dynamically measure the different effects of marginal distribution and conditional distribution, rather than give them the same weight
CONCLUSIONS & FUTURE WORK In this paper, we introduce BDA to dynamically narrow the gap between marginal distribution and conditional distribution differences of heterogeneous datasets with the balance factor
Summary
With the availability of massive storage capabilities, high speed Internet, and the advent of Internet of Things devices, modern software systems are growing in both size and complexity [1]. Subspace learning can reduce data drift during data mapping, but there are still different marginal distribution and conditional distribution in source domain and target domain, which affect the decision result. When there is a big difference between the source domain and the target domain data, marginal distribution adaptation is more important.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.