Abstract
The label quality of defect data sets has a direct influence on the reliability of defect prediction models. In this paper, we conduct a systematic study of inconsistent defect labels in multi-version-project defect data sets, i.e., many instances having the same source code but different labels over multiple versions of a software project. First, we report the phenomena of inconsistent labels by real examples and analyze their essence in the context of defect prediction. Then, we uncover the causes that lead to the occurrence of inconsistent labels for the representative label collection approaches. Finally, we investigate the actual influence of inconsistent labels on defect prediction models. We find that inconsistent labels in general exist in six multi-version-project defect data sets (either widely used or the most up-to-date in the literature) collected by diverse label collection approaches. In particular, inconsistent labels in a training data set significantly reduce the prediction performance of a model, while inconsistent labels in a test data set can lead to a considerable evaluation bias on the real performance. Therefore, we recommend that: on the one hand, researchers leverage our findings to make targeted methodological improvements on existing defect label collection approaches to reduce the generation of inconsistent labels; on the other hand, practitioners detect and exclude inconsistent labels in defect data sets to avoid their potential negative influence on defect prediction.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.