Abstract

The K-means clustering technique is widely used in many fields, such as anomaly detection, customer segmentation, cyber-physical system, medical diagnoses, sentiment analysis, fraud detection, and other similar tasks. We used this k-means technique in handling imbalanced datasets by preserving minority class structure using the stratified resampling technique. For this experimental study, we used a benchmark dataset from Kaggle. It is a labeled dataset collected from online social media regarding fake news. This proposed model, The Stratified k-means Sampling (SKMS), is compared with Synthetic Minority Oversampling Technique (SMOTE) by empirically experimenting using different machine learning algorithms. Random Forest (RF) algorithm gives significant accuracy, and Support Vector Classification (SVC) produces a better F1-score than other algorithms. The SMOTE technique was compared with the same dataset using these same algorithms. While SKMS seeks to preserve the structure of the minority class, SMOTE aims to diversify the minority class by interpolating between existing samples. Depending on the dataset, one might be more relevant than the other.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call