Predictive Models of Student Performance for Data-Driven Learning Analytics

Sean M Shiverick

doi:10.5281/zenodo.3247036

Abstract

Analytic tools are useful for detecting patterns in education data and providing insights about student performance and learning. This study compared six supervised learning algorithms (linear regression, ridge regression, the lasso, regression trees, random forests regression, gradient boosted regression) and identified features important for predicting student performance. The dataset consisted of N=1044 observations from two secondary schools in Portugal (UCI-MLR, Cortez & Silva, 2008). Performance was assessed by final grades (range: 0-20) in two courses, mathematics and Portugese. The models were fit to training data with 27 independent variables and evaluated on a testing subset. Overall, performance was lower for students in mathematics than Portugese. The models selected a similar set of variables as important for predicting performance: mother's education level, student plans for higher education, and weekly study time were positively related to predicted performance, whereas course subject, school educational support, and romantic relationships were associated with decreased student performance. The models differed in the number, weighting, order and importance given to predictor variables. Linear regression provided a model with 13 predictors. Ridge regression shrank the coefficient estimates toward zero; the lasso performed variables selection for a model with 20 predictors. There was a tradeoff between model complexity and interpretability. The single pruned regression tree provided a simple, interpretable non-linear model with four features. Random forests regression and gradient boosting reduced overfitting, but were more difficult to interpret. Advantages and limitations of the different models are discussed. Applications for educational data mining (EDM) and learning analytics (LA) are considered.

Full Text