Learning from unbalanced catastrophic out-of-pocket health expenditure dataset: blending SMOTE-boosting with ensemble models

Songul Cinaroglu

doi:10.1080/0952813x.2022.2143907

Songul Cinaroglu

https://doi.org/10.1080/0952813x.2022.2143907

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

ABSTRACT This study attests to the benefits of synthetic data generation with the Synthetic Minority Oversampling Technique (SMOTE), and it incorporates this procedure with SMOTEBoosting by applying learning algorithms to model unbalanced catastrophic out-of-pocket (OOP) health expenditure dataset. Nationally representative household budget survey data were gathered from Turkish Statistical Institute for the year 2012. A total of 9987 households responded to the survey. The original dataset was highly unbalanced and a total of 0.14% of households faced catastrophic health expenses. SMOTE was used to perform balanced oversampling, and 10 artificial datasets with sizes from 10% to 100% of the majority group of original training data were generated. To predict OOP catastrophic health expenditures, the SMOTEBoosting was embedded with learning algorithms, such as C5.0, random forest (RF), naïve Bayes, and support vector machine. Study results confirm the outstanding prediction performance of the blended strategy of SMOTEBoosting with RF (area under the curve ˃ 0.85) for prediction. A variable importance plot and decision tree visualise that at least 65 years of age is the most important predictor of the catastrophic cases. The findings of this study highlight that multistrategy ensemble learning techniques are useful to model highly unbalanced datasets.

Full Text