This article investigates applying advanced machine learning models, including Random Forest, XGBoost, and LightGBM, to predict human capital readiness in the low-income Kut Bak district. Human capital readiness, encompassing skills, knowledge, and health, is crucial for socio-economic development but remains under-researched in impoverished regions. The dataset comprises 953 individuals from 302 households, with variables including age, gender, health, employment status, and various socio-economic factors. The methodology involves rigorous data preparation, including cleaning, encoding, and normalizing, followed by feature selection and splitting the dataset into training (80%) and testing (20%) sets. The models were evaluated using a 10-fold cross-validation strategy, ensuring robust and generalizable findings. The results indicate that XGBoost outperforms the other models, achieving a mean cross-validation accuracy of 0.991, precision of 0.993, recall of 0.992, F1-score of 0.992, and an AUC-ROC score of 0.995. Random Forest follows closely with a mean accuracy of 0.987, precision of 0.990, recall of 0.988, F1-score of 0.989, and an AUC-ROC score of 0.993. LightGBM, while still performing well, shows slightly lower metrics with a mean accuracy of 0.975, precision of 0.978, recall of 0.976, F1-score of 0.977, and an AUC-ROC score of 0.980. The confusion matrix analysis reveals that XGBoost has the highest number of correct classifications, with 64 true negatives, 0 false positives, 2 false negatives, and 125 true positives. The feature importance analysis highlights ‘Family Responsibilities’, ‘Elderly’, and ‘Crop Cultivation Specialist’ occupation as significant predictors across models, though LightGBM emphasizes ‘Age’ and ‘Gender’. This research underscores the utility of machine learning in socio-economic planning, offering actionable insights for policymakers aiming to enhance human capital readiness in economically disadvantaged areas. Future research should focus on expanding datasets, fine-tuning model parameters, and exploring additional socio-demographic variables for improved predictive accuracy.
Read full abstract