A clustering-based resampling technique with cluster StructureAnalysis for software defect detection in imbalanced datasets

Leonidas Akritidis,Panayiotis Bozanis

doi:10.1016/j.ins.2024.120724

Abstract

Software defect detection focuses on the automatic identification of flaws in software modules. Given the great importance of the problem, numerous researchers have introduced a rich collection of deep learning approaches to confront it. However, the datasets that are used to train the proposed classifiers are in most cases highly imbalanced, leading to models that cannot learn the minority classes effectively, while being biased towards the majority class. The state-of-the-art solutions either overlook the issue of data imbalance, or they confront it insufficiently by ignoring the existence of outliers and the local properties of the classes' distributions. In this work we introduce CBR, a Clustering-Based Resampling technique for mitigating the problem of class imbalance in software defect detection tasks. The proposed method initially employs a quite simple heuristic to determine the maximum distance threshold between two clusters. Then, it uses this threshold to apply hierarchical clustering with the aim of grouping together similar samples. CBR considers the singleton clusters as outliers, and discards the ones originating from the majority class. The algorithm subsequently organizes the clusters into sub-clusters than contain samples from the same class and determines which sub-clusters should participate in the oversampling process. In this way, CBR produces samples of improved quality and variance. We evaluated the performance of CBR against 9 baseline and state-of-the-art techniques by using 27 datasets and a Multilayer Perceptron classifier. The results demonstrate the superiority of CBR in terms of Balanced Accuracy and Precision scores.

Full Text