AbstractA novel approach for software defect prediction of unlabeled datasets is proposed using modified objective cluster analysis (OCA). In this approach, the first step is to construct the distance matrix of instances in the datasets by utilizing the automatically determined clusters through the modified OCA. Then the dipoles within different instances are categorized into two different groups. Finally, the clusters of instances are produced, and software defects can be predicted by imposing a modified consistency criterion. Case study and comparative experiments were conducted based on 12 public datasets selected from the databases of Promise and ReLink using multiple different unsupervised algorithms and cross‐project approaches. There are two experimental settings: experiments with datasets that contain all metrics and experiments with datasets that contain only module size metrics. The results were evaluated by the index of precision, recall, F‐measure, and receiver operating characteristic curve (AUC). Furthermore, a complexity analysis of the tested algorithms was conducted as well. In experiments with datasets with all metrics, the proposed OCA gets the best results in four indexes, and the average values of precision, recall, F‐measure, and AUC were improved by a minimum of 1.52%, 2.78%, 19.84%, and 0.93%, respectively. In experiments with datasets with only module size metrics, the proposed OCA performed the best results in four indexes also, and the average values of precision, F‐measure, and AUC were improved by a minimum of 8.8%, 2.59%, and 8.36%, respectively. The proposed algorithm is of low complexity and provides a new way to efficiently predict software defects with unlabeled datasets.
Read full abstract