Estimating Uncertainty in Labeled Changes by SZZ Tools on Just-In-Time Defect Prediction

Shikai Guo,Hui Li,Lin Huang,Rong Chen,Xiaochen Li,He Jiang,Dongmin Li,Sijia Lv

doi:10.1145/3637226

Abstract

The aim of Just-In-Time (JIT) defect prediction is to predict software changes that are prone to defects in a project in a timely manner, thereby improving the efficiency of software development and ensuring software quality. Identifying changes that introduce bugs is a critical task in just-in-time defect prediction, and researchers have introduced the SZZ approach and its variants to label these changes. However, it has been shown that different SZZ algorithms introduce noise to the dataset to a certain extent, which may reduce the predictive performance of the model. To address this limitation, we propose the Confident Learning Imbalance (CLI) model. The model identifies and excludes samples whose labels may be corrupted by estimating the joint distribution of noisy labels and true labels, and mitigates the impact of noisy data on the performance of the prediction model. The CLI consists of two components: identifying noisy data (Confident Learning Component) and generating a predicted probability matrix for imbalanced data (Imbalanced Data Probabilistic Prediction Component). The IDPP component generates precise predicted probabilities for each instance in the training set, while the CL component uses the generated predicted probability matrix and noise labels to clean up the noise and build a classification model. We evaluate the performance of our model through extensive experiments on a total of 126,526 changes from ten Apache open source projects, and the results show that our model outperforms the baseline methods.

Full Text