Abstract
ABSTRACT This study attests to the benefits of synthetic data generation with the Synthetic Minority Oversampling Technique (SMOTE), and it incorporates this procedure with SMOTEBoosting by applying learning algorithms to model unbalanced catastrophic out-of-pocket (OOP) health expenditure dataset. Nationally representative household budget survey data were gathered from Turkish Statistical Institute for the year 2012. A total of 9987 households responded to the survey. The original dataset was highly unbalanced and a total of 0.14% of households faced catastrophic health expenses. SMOTE was used to perform balanced oversampling, and 10 artificial datasets with sizes from 10% to 100% of the majority group of original training data were generated. To predict OOP catastrophic health expenditures, the SMOTEBoosting was embedded with learning algorithms, such as C5.0, random forest (RF), naïve Bayes, and support vector machine. Study results confirm the outstanding prediction performance of the blended strategy of SMOTEBoosting with RF (area under the curve ˃ 0.85) for prediction. A variable importance plot and decision tree visualise that at least 65 years of age is the most important predictor of the catastrophic cases. The findings of this study highlight that multistrategy ensemble learning techniques are useful to model highly unbalanced datasets.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have