A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction

Lina Gong,Mingqiang Wei,Jingxuan Zhang,Haoxiang Zhang,Zhiqiu Huang

doi:10.1109/tse.2022.3220740

Lina Gong, Mingqiang Wei + Show 3 more

https://doi.org/10.1109/tse.2022.3220740

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Software Defect Prediction (SDP) is one of the most vital and cost-efficient operations to ensure the quality of software under developed. The performance of SDP heavily relies on the characteristics of experimental datasets (or say SDP datasets). However, there often exists the phenomenon of class overlap in the SDP datasets, i.e., defective modules and non-defective modules are similar in terms of values of metrics. Class overlap hinders the smooth performance as well as the use of SDP models. Even though efforts have been made to investigate the impact of overlapping instance removing techniques on the performance of SDP, many open issues are still challenging yet unknown. For example, 1) how to effectively identify the overlapping instances? 2) Whether is the phenomenon of class overlap universal in the SDP datasets? 3) What are the impacts of class overlap on the performance and interpretation of SDP models? Questions like these are very important but have not been fully explored yet. In this paper, we conduct an empirical study to comprehensively investigate the impact of class overlap on SDP. Specifically, we first propose an overlapping instances identification approach by analyzing the class distribution in the local neighborhood of a given instance. Based on the approach, we then investigate the impact of class overlap on the performance and the interpretation of seven representative SDP models. Finally, we investigate the impact of two common overlapping instance handling techniques (i.e., removing and separating techniques) on the performance of SDP models. Through an extensive case study on 230 datasets that span across industrial and open-source software projects, we observe that: i) 70.0% of SDP datasets contain overlapping instances; ii) different levels of class overlap have different impacts on the performance of SDP models. The class overlap ratio and the number of instances seriously affect the stability of the performance of SDP models; iii) class overlap affects the rank of the important feature list of SDP models, particularly the feature lists at the top 2 and top 3 ranks; IV) Class overlap handling techniques could statistically significantly improve the performance of SDP models trained on datasets with over 12.5% overlap ratios. Therefore, on the basis of these findings we suggest that future work in SDP should apply our proposed KNN method to: i) identify whether the overlap ratios of their defect datasets are greater than 12.5% before building SDP models; ii) remove the overlapping instances to find the more consistent guiding significance metrics; iii) combine RF classifier and class overlap handling techniques when reducing the efforts to review codes.

Full Text