Abstract
In the field of machine learning, the issue of class imbalance is a common problem. It refers to an imbalance in the quantity of data collected, where one class has a significantly larger number of data compared to another class, which can negatively affect the classification efficiency of algorithms. Under-sampling methods address class imbalance by reducing the quantity of data in the majority class, thereby achieving a balanced dataset and mitigating the class imbalance problem. Traditional under-sampling methods based on k-means clustering either set the unified value of k (number of clusters) or determine it directly based on the quantity of data in the minority or majority class. This paper proposes an adaptive k-means clustering under-sampling algorithm that calculates an appropriate k for each dataset. After clustering the majority class dataset into k clusters, our algorithm calculates the distances between the data within each cluster and the cluster centroids from two perspectives and selects data based on these distances. Subsequently, the subset of the majority class dataset are combined with the minority class dataset to generate a new balanced dataset, which is then used for classification algorithms. The performance of our algorithm is evaluated on 45 datasets. Experimental results demonstrate that our algorithm can dynamically determine appropriate k for different datasets and output a balanced dataset, thus enhancing the classification efficiency of machine learning algorithms. This work can provide new algorithmic ensemble strategies for addressing class imbalance problem.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.