Abstract

A characteristic feature of real-world applications is the occurrence of dataset class imbalance in the output class distribution. Predictive modeling contributions from the minority or underrepresented class are overlooked by most learning algorithms. Addressing this challenge includes applying re-sampling techniques that eliminate class distribution imbalance for a more balanced output class distribution in the training examples. Random sampling techniques such as random over-sampling of the minority class duplicates the minority class examples to achieve a more balanced distribution or random under-sampling to delete training examples in the majority class for a balanced distribution to eliminate class imbalance in the dataset. The usefulness of these random sampling techniques has received attention in several research studies, particularly for binary classifications in two-class or multi-classification problems. This application, to many, is aimed at achieving equal class distribution meant to determine optimal model performance. This comparative assessment of random sampling optimization uses five classification-based algorithms, namely: extreme gradient boosting, gradient boosting, random forest, support vector machines and logistic regression, to evaluate predictive performance in random sampling on a real-world healthcare dataset of patients suffering from hypertension with comorbidities. The average prediction accuracy score (balanced accuracy) obtained shows statistically significant differences between scores obtained at the pre-sampling and post-sampling stages. The lowest score obtained with post-sampling was 85.55%, as against 54% in pre-sampling. However, high auc_roc score recorded in pre-sampling, over-sampling and under-sampling indicate a statistically insignificant impact of over-sampling and under-sampling use in this context. This confirms that the impact and effect of random sampling use in predictive modeling can better be explained in context.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call