A Comparative Study of CatBoost and Double Random Forest for Multi-class Classification

Annisarahmi Nur Aini Aldania,Khairil Anwar Notodiputro,Agus Mohamad Soleh

doi:10.29207/resti.v7i1.4766

Annisarahmi Nur Aini Aldania, Khairil Anwar Notodiputro + Show 1 more

Open Access

https://doi.org/10.29207/resti.v7i1.4766

Copy DOI

Abstract

Multi-class classification has its challenge compared to binary classification. The challenges mainly caused by the interactions between explanatory and responses variable are increasingly complex. Ensemble-based methods such as boosting and random forest (RF) have been proven to handle classification problems. We conducted this research to study multi-class classification using CatBoost, a method developed with gradient boosting and double random forest (DRF), RF’s development that is good to be used when the resulting RF model is underfitting. Analysis was carried out using simulation and empirical data. In the simulation study, we generate data based on the distance between classes: high, medium, and low. The empirical data used is the industrial classification code, namely KBLI. CatBoost and DRF can rightly solve the multi-class classification problem at a high distance, measured by a 100% balanced accuracy score. At a medium distance, CatBoost and DRF produce balanced accuracy scores of 99.25% and 97.54%, respectively, whereas 32.37% and 23.97% at the low distance. In empirical studies, CatBoost’s performance outperforms DRF by 4.27%. All the differences are statistically significant based on the t-test result. We also use LIME to explain individual predictions of CatBoost and learn words that contribute the most to an example class’s prediction.

Full Text