Comparison of Error Rate Prediction Methods in Binary Logistic Regression Modeling for Imbalanced Data

Bahri Annur Sinaga Bahri Annur Sinaga,Dodi Vionanda Dodi Vionanda,Admi Salma Admi Salma,Dony Permana Dony Permana

doi:10.24036/ujsds/vol1-iss4/86

Bahri Annur Sinaga Bahri Annur Sinaga, Dodi Vionanda Dodi Vionanda + Show 2 more

Open Access

https://doi.org/10.24036/ujsds/vol1-iss4/86

Copy DOI

Journal: UNP Journal of Statistics and Data Science	Publication Date: Aug 28, 2023
License type: CC BY 4.0

Abstract

Binary logistic regression is a regression analysis used in classification modeling. The performance of binary logistic regression can be seen from the accuracy of the model formed. Accuracy can be measured by predicting the error rate. One method of predicting the error rate that is often used is cross validation. There are three algorithms in cross validation, namely leave one out, hold out, and k-fold. Leave one out is a method that divides data based on the number of observations so that each observation has the opportunity to become testing data but requires a long time in the analysis process when the number of observations is large. Hold out is the simplest algorithm that only divides the data into two parts randomly so there is a possibility that important data does not become training data. K-fold is an algorithm that divides data into several groups, but k-fold is not suitable for data that has a small number of observations. In reality, real data found is often imbalanced. In logistic regression when the data is increasingly imbalanced the prediction results will approach the number of minority classes. This research focuses on the comparison of error rate prediction methods in binary logistic regression modeling with imbalanced data. This study uses three types of data, namely univariate, bivariate and multivariate, which are generated by differences in population mean and correlation between independent variables. The results obtained are k-fold algorithm is the most suitable error rate prediction algorithm applied to binary logistic regression.

Full Text