Abstract

Classification problems involving imbalance data will affect the performance of classifiers. In predictive analytics, logistic regression is a statistical technique which is often used as a benchmark when other classifiers, such as Naive Bayes, decision tree, artificial neural network and support vector machine, are applied to a classification problem. This study investigates the effect of imbalanced ratio in the response variable on the parameter estimate of the binary logistic regression via a simulation study. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1 % to 50 %, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that imbalance ratio affects the parameter estimates where severe imbalance (IR = 1 %, 2 %, 5 %) has higher MSE. Additionally, the effects of high imbalance (IR ≤ 5 %) will be more severe when sample size is small (n = 100 & n = 500). Further investigation using real dataset from the UCI repository (Bupa Liver (n = 345) and Diabetes Messidor, n = 1151)) confirmed the imbalanced ratio effect on the parameter estimates and the odds ratio, and thus will lead to misleading results.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call