Abstract

In this paper, we investigate how to predict student drop-out using machine-learning approaches. We discuss three methods: logistic regression, random forest and gradient boosting (GB). We used the KDD cup 2015 dataset. Since our dataset is unbalanced, a sampling technique called synthetic minority over-sampling technique (SMOTE) was used. Six models are then obtained, for each model we used a grid search cross-validation to find which are the parameters that optimize the mapping function according to the accuracy and the area under the receiver operating characteristic (AUC ROC) score. In order to validate our scores, we used the 10-fold cross-validation technique. Our findings show that ensemble machine-learning methods produce better results than basic machine-learning methods. The obtained results have proved that the extracted features and the used techniques are able to predict the drop-out with high efficiency. These relevant models were used with autoregressive moving average (ARIMA) model for time series data to detect the student at high risk of drop-out in earlier weeks of the course. The obtained AUC ROC for KDD Cup dataset with GB reaches 0.89 which is 0.1 close to the KDD cup winner.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call