Reading is a fundamental skill essential for interdisciplinary understanding and serves as a crucial indicator of the educational quality of a nation. PISA provides an international evaluation of students' reading literacy across various countries, including Indonesia. Numerous studies have utilized machine learning algorithms to predict reading literacy; however, achieving high model accuracy remains a significant challenge. This study compares the performance of Gradient Boosting Decision Trees (GBDT) and Extreme Gradient Boosting (XGBoost), two widely recognized machine learning algorithms for predicting reading literacy, utilizing PISA 2022 data from 12.853 Indonesian students and 59 variables from the Student Questionnaire Data File. The GridSearchCV optimization method was employed to select the optimal parameters for each model to ensure the best performance on the test data. GBDT achieved R² of 0.5106, with optimal parameters (n_estimators = 150, learning_rate = 0.2, max_depth = 3, subsample = 0.9). XGBoost reached a higher R² of 0.5247, with parameters (n_estimators = 1000, learning_rate = 0.01, max_depth = 7, colsample_bytree = 0.3, min_child_weight = 20, gamma = 1, alpha = 0), indicating XGBoost's superior performance in predicting reading literacy. Further analysis revealed that the most significant variables in the GBDT model included students' access to technology at home, extracurricular creative activities, socioeconomic status, school involvement in sustainable development, and problem-solving skills. In contrast, significant variables in the XGBoost model included family support, socioeconomic status, school belongingness, family environment's effectiveness in fostering creativity, and student imagination. These findings provide valuable insights for policymakers in education to understand the factors contributing to students' reading literacy and to design more targeted and effective interventions
Read full abstract