Comparison of Error Rate Prediction Methods in Classification Modeling with the CHAID Method for Imbalanced Data

Seif Adil El-Muslih Seif Adil El-Muslih,Dodi Vionanda Dodi Vionanda,Nonong Amalita Nonong Amalita,Admi Salma Admi Salma

doi:10.24036/ujsds/vol1-iss4/81

Seif Adil El-Muslih Seif Adil El-Muslih, Dodi Vionanda Dodi Vionanda + Show 2 more

Open Access

https://doi.org/10.24036/ujsds/vol1-iss4/81

Copy DOI

Abstract

CHAID (Chi-Square Automatic Interaction Detection) is one of the classification algorithms in the decision tree method. The classification results are displayed in the form of a tree diagram model. After the model is formed, it is necessary to calculate the accuracy of the model. The aims is to see the performance of the model. The accuracy of this model can be done by calculating the predicted error rate in the model. There are three methods, such as Leave one out cross-validation (LOOCV), Hold-out, and K-fold cross-validation. These methods have different performances in dividing data into training and testing data, so each method has advantages and disadvantages. Imbalanced data is data that has a different number of class observations. In the CHAID method, imbalanced data affects the prediction results. When the data is increasingly imbalanced the prediction result will approach the number of minority classes. Therefore, a comparison was made for the three error rate prediction methods to determine the appropriate method for the CHAID method in imbalanced data. This research is included in experimental research and uses simulated data from the results of generating data in RStudio. This comparison was made by considering several factors, for the marginal opportunity matrix, different correlations, and several observation ratios. The results of the comparison will be observed using a boxplot by looking at the median error rate and the lowest variance. This research finds that K-fold cross-validation is the most suitable error rate prediction method applied to the CHAID method for imbalanced data.

Full Text