Optimizing Ensemble Trees for Big Data Healthcare Fraud Detection

John Hancock,Taghi M Khoshgoftaar

doi:10.1109/iri54793.2022.00061

Abstract

We show the importance of maximum tree depth in classifying publicly available Big Data with ensemble learners. To the best of our knowledge, this is the largest dataset used in any study on the impact of maximum tree depth on classification of Big Data with multiple ensemble learners. We present classification results for two popular, open source Machine Learning Algorithms, XGBoost and Random Forest. These learners come from different families of algorithms. XG Boost is a boosting method, whereas Random Forest is a bagging method. We find increasing maximum tree depth has a profound impact on their performance. Our contribution is to show how important the maximum tree depth is, when working with imbalanced Big Data with high-cardinality categorical features. For the data we use in this study, our results show that one should increase maximum tree depth; otherwise, classification scores may be suboptimal. XGBoost's mean AUC scores increase from 0.75633 with default maximum tree depth of 6, to 0.97273 with maximum tree depth of 24. Similarly, Random Forest's mean AUC score increases from 0.80821 with its default maximum tree depth of 16, to a mean AUC of 0.96996 with maximum tree depth set to 32.

Full Text