Abstract

Software defect prediction is an active topic in the field of software engineering. Cross-project defect prediction (CPDP) adopts the defect data set of the source project to predict the defects of the target project. However, the metrics of the source project and those of the target project are often different, and the traditional CPDP has certain limitations at this time. To address the inconsistency of source and target metrics, researchers propose heterogeneous cross-project defect prediction (HCPDP). To improve the performance of the HCPDP, we propose new Two-stage Cost-sensitive Local Models (TCLM). TCLM aims to improve on the problem of feature selection, linear inseparability of heterogeneous data, class imbalance and data adoption problems in HCPDP. Firstly, in the feature selection stage, we add cost information to improve the feature selection algorithm. Then, KCCA (Kernel Canonical Correlation Analysis) is used to project and map the heterogeneous data into a common feature space so as to mitigate the problem of inconsistent feature sets of the source and the target projects. Secondly, in the model training stage, we adopt local models to improve the performance, and introduce cost information to deal with the class imbalance problem. To verify the effectiveness of the TCLM method, we conduct large-scale empirical study on 24 projects in the AEEEM, PROMISE, NASA, and Relink datasets. Experimental results show that TCLM indeed outperforms the previous work. Therefore, we recommend using the TCLM method to build an HCPDP model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call