The prevalence of anemia in children (5-12 years) remains a public health issue in Indonesia. Early detection and control of risk factors are crucial for prevention. Machine learning models can be employed to address this problem. One practical approach is using ensemble learning models. However, it is expected to encounter imbalanced class problems when analyzing health data. Therefore, this study aims to perform classification modeling using two ensemble learning models: Random Forest (RF) and CatBoost. The proposed methods for handling imbalanced class issues include Random Over Sampling, SMOTE, G-SMOTE, Random Under Sampling, Instance Hardness Threshold (IHT), and SMOTE-ENN. Additionally, SHAP is used to explain the best-performing model based on Shapley values. The research findings indicate that the ensemble learning model using the CatBoost algorithm with G-SMOTE data handling produces the best performance compared to other methods. Based on the average performance metrics from 100 replicate validation, the CatBoost G-SMOTE model produces a sensitivity of 0.7104, specificity of 0.7043, G-Mean of 0.7067, and AUC of 0.7844. Handling the imbalance class problem using the G-SMOTE method effectively increases the sensitivity value in the two proposed ensemble learning models. Meanwhile, the SMOTE-ENN method produces effective G-Mean values for the Random Forest (RF) algorithms. Based on Shapley's value, the features with the highest contribution to predicting anemia in children (5-12 years) are ferritin, vitamin A, consumption of vegetables, diagnosed pneumonia, zinc, calcium total, and consumption of soft or carbonated drinks.
Read full abstract