Adaptive K-means clustering based under-sampling methods to solve the class imbalance problem

Qian Zhou,Bo Sun

doi:10.1016/j.dim.2023.100064

Abstract

In the field of machine learning, the issue of class imbalance is a common problem. It refers to an imbalance in the quantity of data collected, where one class has a significantly larger number of data compared to another class, which can negatively affect the classification efficiency of algorithms. Under-sampling methods address class imbalance by reducing the quantity of data in the majority class, thereby achieving a balanced dataset and mitigating the class imbalance problem. Traditional under-sampling methods based on k-means clustering either set the unified value of k (number of clusters) or determine it directly based on the quantity of data in the minority or majority class. This paper proposes an adaptive k-means clustering under-sampling algorithm that calculates an appropriate k for each dataset. After clustering the majority class dataset into k clusters, our algorithm calculates the distances between the data within each cluster and the cluster centroids from two perspectives and selects data based on these distances. Subsequently, the subset of the majority class dataset are combined with the minority class dataset to generate a new balanced dataset, which is then used for classification algorithms. The performance of our algorithm is evaluated on 45 datasets. Experimental results demonstrate that our algorithm can dynamically determine appropriate k for different datasets and output a balanced dataset, thus enhancing the classification efficiency of machine learning algorithms. This work can provide new algorithmic ensemble strategies for addressing class imbalance problem.

Full Text