Abstract

A growing number of software projects makes it increasingly crucial to predict software defects. If adequate historical data are accessible, within-project defect prediction models can be reliable. However, during the early phases of software development, sufficient data are not available to train an effective predictor. Cross-project defect prediction (CPDP) utilizes information from previous mature projects (source data) for predicting whether new software modules (target data) will be defective. CPDP models must take into account the fact that data distributions between source and target projects are different. Cross-project defect prediction often reduces distribution differences by either selecting training data or using transfer learning methods. Using transfer learning effectively reduces distribution differences in recent CPDP models, yet none of them have taken into account the possibility that negative transfer may occur as a result of the imbalanced nature of defect data. In this paper, a four-step model is proposed, of which three steps are dedicated to the preparation of training data and their initial weights for use in the fourth step, which involves an enhanced version of the transfer boosting algorithm. In this algorithm, the imbalance nature of data is considered and the weighting of the source data is updated to enhance the prediction performance. Therefore, aside from reducing distribution discrepancy between source and target data, this model also addresses the issues related to defect data class imbalance. As compared to four state-of-the-art CPDP models, this model provided consistent and accurate predictions for fifteen projects from PROMISE, AEEEM, and SOFTLAB. Our proposed model provided the best average results for both AUC and F-measure and in some datasets, the improvements were more than 5%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call