Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

On stability and robustness of students' dropout prediction

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

ABSTRACT High dropout rate is a serious problem at universities all over the world. To study factors of students' dropout we analyzed data about students who enrolled for bachelor study at one Czech university in five subsequent academic years 2013/14–2017/18. Using decision trees and logistic regression we created several classification models to predict who will finish the study successfully and who will not. Analyzing data collected on a semester basis, we found that the most important variable for all five years is the percentage of lost credits in the most recent semester. We also assess the stability and robustness of the classification models by testing them on data about students who enrolled in the subsequent years. Here, we found that the decision trees are more stable and robust than logistic regression models.

Similar Papers
  • Research Article
  • Cite Count Icon 73
  • 10.1109/access.2020.3045157
MOOC Dropout Prediction Using FWTS-CNN Model Based on Fused Feature Weighting and Time Series
  • Jan 1, 2020
  • IEEE Access
  • Yafeng Zheng + 3 more

High dropout rates have been a major problem affecting the development of Massive Open Online Courses (MOOCs). Student dropout prediction can help teachers identify students who are tending to fail and provide extra help in a timely manner, helping to improve the effectiveness of online learning. In recent years, the use of convolutional neural networks for dropout prediction has yielded good results. However, traditional convolutional neural networks use automatic feature extraction, which does not consider the importance of the learner's behavior features and the effect of the time series of behavior on dropout, so it is difficult to guarantee the final prediction effect. To solve this problem, this article proposes a convolutional neural network model FWTS-CNN that integrates feature weighting and behavioral time series. It extracts continuous behavioral features from the learner's log of learning activities, filters key features and ranks them by importance based on the decision tree, then weights the continuous behavioral features based on importance, and finally builds a convolutional neural network model based on behavioral time series and weighted features. Experiments on the KDD Cup 2015 dataset show that the FWTS-CNN dropout prediction model has a high accuracy, which can reach more than 87%, an improvement of about 2% over using the CNN algorithm alone. The FWTS-CNN model integrates the effects of behavioral features and behavior time on dropout, effectively improving the accuracy of dropout prediction.

  • Research Article
  • Cite Count Icon 82
  • 10.1155/2019/8404653
MOOC Dropout Prediction Using a Hybrid Algorithm Based on Decision Tree and Extreme Learning Machine
  • Jan 1, 2019
  • Mathematical Problems in Engineering
  • Jing Chen + 5 more

Massive Open Online Courses (MOOCs) have boomed in recent years because learners can arrange learning at their own pace. High dropout rate is a universal but unsolved problem in MOOCs. Dropout prediction has received much attention recently. A previous study reported the problem of learning behavior discrepancy leading to a wide range of fluctuation of prediction results. Besides, previous methods require iterative training which is time intensive. To address these problems, we propose DT‐ELM, a novel hybrid algorithm combining decision tree and extreme learning machine (ELM), which requires no iterative training. The decision tree selects features with good classification ability. Further, it determines enhanced weights of the selected features to strengthen their classification ability. To achieve accurate prediction results, we optimize ELM structure by mapping the decision tree to ELM based on the entropy theory. Experimental results on the benchmark KDD 2015 dataset demonstrate the effectiveness of DT‐ELM, which is 12.78%, 22.19%, and 6.87% higher than baseline algorithms in terms of accuracy, AUC, and F1‐score, respectively.

  • Research Article
  • Cite Count Icon 9
  • 10.55463/issn.1674-2974.49.4.17
Building Multiclass Classification Model of Logistic Regression and Decision Tree Using the Chi-Square Test for Variable Selection Method
  • Apr 30, 2022
  • Journal of Hunan University Natural Sciences
  • Waego H Nugroho + 3 more

The growth and development of children under five (toddlers) affect their health conditions. Each region uniquely identifies the main factors influencing the toddler's health condition. The status of toddlers is generally categorized into two classes, namely normal and abnormal, so it is often found that the condition of toddler status is in the form of multi-response variables. Combining the two binary classes' response variables will form a multiclass response variable requiring different model development techniques and performance measurements. This study aims to determine the main factors that affect toddlers' health conditions in Malang, Indonesia, build multiclass logistic regression and decision tree classification models, and measure the model's performance. The Chi-square test selected predictor features as the input of multiclass logistic regression and decision tree models. From the feature selection, four main factors influence the status of toddlers' health conditions in Malang: the mother's history of diabetes before pregnancy, the father's blood pressure, psychological condition, and drinking water quality. The decision tree model performs better than the logistic regression model on the various performance measures used.

  • Conference Article
  • Cite Count Icon 94
  • 10.1109/iccse.2016.7581554
Machine learning application in MOOCs: Dropout prediction
  • Aug 1, 2016
  • Jiajun Liang + 2 more

Massive Open Online Course(MOOC) is undergoing explosive growth recently, both the number of MOOC platforms and courses are increasing dramatically during these years. One of the major concerns in MOOC is high dropout rate, we study dropout prediction in MOOCs, using student's learning activities data in a period of time to measure how likely students would drop out in next couple of days. We collect 39 courses data from XuetangX platform, which is based on the open source Edx platform. Using supervised classification approach in the machine learning field, we achieve 89% accuracy in dropout prediction task with gradient boosting decision tree model. We describe details in drop out prediction framework, including data extraction from Edx platform, data preprocessing, feature engineering and performance test on several supervised classification models.

  • Research Article
  • 10.55041/ijsrem44873
Dropout Prediction with Supervised Learning
  • Apr 17, 2025
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Shivani Awasthi

Student dropout is a critical issue in the education sector, impacting institutional efficiency and student success. This project, Dropout Prediction with Supervised Learning, leverages machine learning models—Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbours (KNN), and Naïve Bayes (NB)—to predict student dropouts based on historical academic, demographic, and behavioural data. The study involves data preprocessing, feature selection, and model evaluation to identify key factors influencing dropout rates. Supervised learning techniques are employed to classify students into "at-risk" and "not at-risk" categories. The performance of each model is assessed using accuracy, precision, recall, and F1-score metrics to determine the most effective predictor. The findings aim to provide educational institutions with actionable insights, enabling early intervention strategies such as academic counselling and financial aid support. By implementing predictive analytics, institutions can enhance student retention and improve overall educational outcomes. Keywords – Dropout Prediction, Supervised Learning, Machine Learning Models, Student Retention, Predictive Analytics, Classification Algorithms

  • Research Article
  • 10.1186/s12888-025-07261-w
Unveiling postpartum PTSD: predicting risk factors using decision trees and logistic regression in Chinese women
  • Aug 19, 2025
  • BMC Psychiatry
  • Xiao Fei Nie + 6 more

BackgroundWhile traditional logistic regression emphasizes main effects with limited capacity for interaction detection, emerging decision trees excel in uncovering complex associations. However, no studies have yet integrated both approaches to investigate postpartum posttraumatic stress disorder (PP-PTSD). This study aims to explore the factors associated with postpartum posttraumatic stress disorder (PP-PTSD) in Chinese women using decision tree and logistic regression models, while also comparing the predictive performance of both approaches.MethodsThis cross-sectional study recruited postpartum women using convenience sampling between June 2021 and December 2022. PTSD was assessed using the City Birth Trauma Scale (City BiTS). The Perceived Social Support Scale (PSSS), Simplified Coping Style Questionnaire (SCSQ), Pregnancy Stress Rating Scale (PSRS), and Connor-Davidson Resilience Scale (CD-RISC) were employed to evaluate perceived social support, psychological coping strategies, pregnancy stress and resilience, respectively. Decision tree and logistic regression models were applied to identify factors associated with PTSD.ResultsAmong 704 valid participants, 36 (5.11%) screened positive for PP-PTSD. Logistic regression identified postpartum duration, sleep quality, pregnancy stress, family support, and positive coping as significant predictors of PP-PTSD (p < 0.05). The decision tree model highlighted postpartum sleep quality as the primary determinant, followed by pregnancy stress and postpartum duration. While both models achieved perfect sensitivity (100%), logistic regression demonstrated superior overall performance, with a 2.28% higher classification accuracy (97.73% vs. 95.45%) and enhanced specificity (97.9% vs. 88.9%). The AUC values further validated this advantage (0.992 vs. 0.968).ConclusionsThis study utilized Logistic Regression and Decision Tree models to identify key factors influencing PP-PTSD, which include postpartum duration, sleep quality, pregnancy stress, family support, and positive coping. The identified modifiable factors enable targeted PP-PTSD prevention, with Logistic Regression providing high-accuracy screening tools and Decision Trees simplifying risk assessment in community settings.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 22
  • 10.1186/s13690-021-00782-2
Vaccination dropout rates among children aged 12-23 months in Democratic Republic of the Congo: a cross-sectional study
  • Jan 5, 2022
  • Archives of Public Health
  • Harry-César Kayembe-Ntumba + 6 more

BackgroundOverall, 1.8 million children fail to receive the 3-dose series for diphtheria, tetanus and pertussis each year in the Democratic Republic of the Congo (DRC). Currently, an emergency plan targeting 9 provinces including Kinshasa, the capital of the DRC, is launched to reinforce routine immunization. Mont Ngafula II was the only health district that experienced high vaccination dropout rates for nearly five consecutive years. This study aimed to identify factors predicting high immunization dropout rates among children aged 12-23 months in the Mont Ngafula II health district.MethodsA cross-sectional household survey was conducted among 418 children in June-July 2019 using a two-stage sampling design. Socio-demographic and perception data were collected through a structured interviewer-administered questionnaire. The distribution of 2017-2018 immunization coverage and dropout rate was extracted from the local health district authority and mapped. Logistic random effects regression models were used to identify predictors of high vaccination dropout rates.ResultsOf the 14 health areas in the Mont Ngafula II health district, four reported high vaccine coverage, only one recorded low vaccine coverage, and three reported both low vaccine coverage and high dropout rate. In the final multivariate logistic random effects regression model, the predictors of immunization dropout among children aged 12-23 months were: living in rural areas, unavailability of seats, non-compliance with the order of arrival during vaccination in health facilities, and lack of a reminder system on days before the scheduled vaccination.ConclusionsOur results advocate for prioritizing targeted interventions and programs to strengthen interpersonal communication between immunization service providers and users during vaccination in health facilities and to implement an SMS reminder system on days before the scheduled vaccination.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.3389/fonc.2022.934108
Evaluation of the Efficiency of MRI-Based Radiomics Classifiers in the Diagnosis of Prostate Lesions.
  • Jul 5, 2022
  • Frontiers in Oncology
  • Linghao Li + 8 more

ObjectiveTo compare the performance of different imaging classifiers in the prospective diagnosis of prostate diseases based on multiparameter MRI.MethodsA total of 238 patients with pathological outcomes were enrolled from September 2019 to July 2021, including 142 in the training set and 96 in the test set. After the regions of interest were manually segmented, decision tree (DT), Gaussian naive Bayes (GNB), XGBoost, logistic regression, random forest (RF) and support vector machine classifier (SVC) models were established on the training set and tested on the independent test set. The prospective diagnostic performance of each classifier was compared by using the AUC, F1-score and Brier score.ResultsIn the patient-based data set, the top three classifiers of combined sequences in terms of the AUC were logistic regression (0.865), RF (0.862), and DT (0.852); RF “was significantly different from the other two classifiers (P =0.022, P =0.005), while logistic regression and DT had no statistical significance (P =0.802). In the lesions-based data set, the top three classifiers of combined sequences in terms of the AUC were RF (0.931), logistic regression (0.922) and GNB (0.922). These three classifiers were significantly different from.ConclusionThe results of this experiment show that radiomics has a high diagnostic efficiency for prostate lesions. The RF classifier generally performed better overall than the other classifiers in the experiment. The XGBoost and logistic regression models also had high classification value in the lesions-based data set.

  • Research Article
  • Cite Count Icon 14
  • 10.1093/gastro/goac053
Development and validation of novel models for the prediction of intravenous corticosteroid resistance in acute severe ulcerative colitis using logistic regression and machine learning.
  • Jan 25, 2022
  • Gastroenterology Report
  • Si Yu + 8 more

BackgroundThe early prediction of intravenous corticosteroid (IVCS) resistance in acute severe ulcerative colitis (ASUC) patients remains an unresolved challenge. This study aims to construct and validate a model that accurately predicts IVCS resistance.MethodsA retrospective cohort was established, with consecutive inclusion of patients who met the diagnosis criteria of ASUC and received IVCS during index hospitalization in Peking Union Medical College Hospital between March 2012 and January 2020. The primary outcome was IVCS resistance. Classification models, including logistic regression and machine learning-based models, were constructed. External validation was conducted in an independent cohort from Shengjing Hospital of China Medical University.ResultsA total of 129 patients were included in the derivation cohort. During index hospitalization, 102 (79.1%) patients responded to IVCS and 27 (20.9%) failed; 18 (14.0%) patients underwent colectomy in 3 months; 6 received cyclosporin as rescue therapy, and 2 eventually escalated to colectomy; 5 succeeded with infliximab as rescue therapy. The Ulcerative Colitis Endoscopic Index of Severity (UCEIS) and C-reactive protein (CRP) level at Day 3 are independent predictors of IVCS resistance. The areas under the receiver-operating characteristic curves (AUROCs) of the logistic regression, decision tree, random forest, and extreme-gradient boosting models were 0.873 (95% confidence interval [CI], 0.704–1.000), 0.648 (95% CI, 0.463–0.833), 0.650 (95% CI, 0.441–0.859), and 0.604 (95% CI, 0.416–0.792), respectively. The logistic regression model achieved the highest AUROC value of 0.703 (95% CI, 0.473–0.934) in the external validation.ConclusionsIn patients with ASUC, UCEIS and CRP levels at Day 3 of IVCS treatment appeared to allow the prompt prediction of likely IVCS resistance. We found no evidence of better performance of machine learning-based models in IVCS resistance prediction in ASUC. A nomogram based on the logistic regression model might aid in the management of ASUC patients.

  • Research Article
  • 10.5935/jetia.v11i55.1910
Multi-Stage Feature Selection for Optimizing Student Dropout Prediction
  • Jan 1, 2025
  • ITEGAM- Journal of Engineering and Technology for Industrial Applications (ITEGAM-JETIA)
  • Arif Mudi Priyatno + 5 more

The high rate of college dropouts is a significant challenge in higher education. Dropout prediction requires an accurate model and is supported by a selection of relevant features. This study proposes a step-by-step feature selection framework to improve prediction accuracy, consisting of three stages, namely Variance Threshold, Mutual Information, and Boruta. The classification model is built using the Extreme Gradient Boosting (XGBoost) algorithm, with evaluation through Stratified 10-fold Cross-Validation. The dataset used includes 4,423 student data that reflects academic, demographic, and socioeconomic information. A total of 18 features were confirmed to be relevant by Boruta. XGBoost models trained on selected features show high performance, with an accuracy of 90.77%, precision of 92.07%, recall of 83.68%, and an F1-score of 87.63%. These results show that the integration of filter and wrapper approaches in the feature selection process effectively improves the performance of the dropout prediction model. This framework is able to filter out important features and produce a more stable and efficient classification model in the context of higher education.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-031-20102-8_40
MOOC Dropout Prediction Based on Bayesian Network
  • Jan 1, 2023
  • Shuang Shi + 4 more

High dropout rates and unsatisfactory learning outcomes have become the main problems of MOOC platforms, and the intervention of dropout prediction at the early stage is an effective way to solve these problems. To this end, we propose a dropout prediction model based on Bayesian networks (Dropout Prediction Bayesian Network, DPBN), which uses mutual information and the pruning to construct the structure of DPBN, and then the parameters are learned by the maximum likelihood estimation (MLE). The model can represent the influence of each feature on the dropout rate and enhance the interpretability of the model. Based on the constructed DPBN, we adopt the exact inference method to predict the dropouts successfully. The experimental results demonstrate the accuracy and validity of our proposed method.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.33003/fjs-2023-0706-2103
STUDENT DROPOUT PREDICTION USING MACHINE LEARNING
  • Dec 31, 2023
  • FUDMA JOURNAL OF SCIENCES
  • Eric E Osemwegie + 2 more

In a higher education environment, we considered the likelihood of probable dropouts from a first-year undergraduate Computer Science program. In order to achieve this, data from five academic sessions were obtained from the Department of Computer Science, University of Benin, Nigeria. Out of nine hundred and forty seven (947) data obtained, only a total of nine hundred and six (906) was usable after cleaning and preprocessing. Six distinct classifiers including Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbor (KNN), and Artificial Neural Networks (ANN) were modeled for the prediction of student success and dropouts. The performance six were stated to have performed on average at 90.4%, 98.9%, 98.5%, 97.4%, 96.0% and 97.3% respectively. Although there wasn't much of a performance difference between the DT, SVM, and LR, the LR model was chosen for deployment since it performs better than the other two models in terms of F1_score and Recall.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.1186/s12889-022-12617-y
Construction of Xinjiang metabolic syndrome risk prediction model based on interpretable models
  • Feb 8, 2022
  • BMC Public Health
  • Yan Zhang + 9 more

BackgroundWe aimed to construct simple and practical metabolic syndrome (MetS) risk prediction models based on the data of inhabitants of Urumqi and to provide a methodological reference for the prevention and control of MetS.MethodsThis is a cross-sectional study conducted in the Xinjiang Uygur Autonomous Region of China. We collected data from inhabitants of Urumqi from 2018 to 2019, including demographic characteristics, anthropometric indicators, living habits and family history. Resampling technology was used to preprocess the data imbalance problems, and then MetS risk prediction models were constructed based on logistic regression (LR) and decision tree (DT). In addition, nomograms and tree diagrams of DT were used to explain and visualize the model.ResultsOf the 25,542 participants included in the study, 3,267 (12.8%) were diagnosed with MetS, and 22,275 (87.2%) were diagnosed with non-MetS. Both the LR and DT models based on the random undersampling dataset had good AUROC values (0.846 and 0.913, respectively). The accuracy, sensitivity, specificity, and AUROC values of the DT model were higher than those of the LR model. Based on a random undersampling dataset, the LR model showed that exercises such as walking (OR=0.769) and running (OR= 0.736) were protective factors against MetS. Age 60 ~ 74 years (OR=1.388), previous diabetes (OR=8.902), previous hypertension (OR=2.830), fatty liver (OR=3.306), smoking (OR=1.541), high systolic blood pressure (OR=1.044), and high diastolic blood pressure (OR=1.072) were risk factors for MetS; the DT model had 7 depth layers and 18 leaves, with BMI as the root node of the DT being the most important factor affecting MetS, and the other variables in descending order of importance: SBP, previous diabetes, previous hypertension, DBP, fatty liver, smoking, and exercise.ConclusionsBoth DT and LR MetS risk prediction models have good prediction performance and their respective characteristics. Combining these two methods to construct an interpretable risk prediction model of MetS can provide methodological references for the prevention and control of MetS.

  • Research Article
  • Cite Count Icon 58
  • 10.1109/access.2018.2881275
An Integrated Framework With Feature Selection for Dropout Prediction in Massive Open Online Courses
  • Jan 1, 2018
  • IEEE Access
  • Lin Qiu + 2 more

Massive open online courses (MOOCs) have flourished in recent years, which is conducive to the redistribution of high-quality educational resources globally. However, the high dropout rate in the course of operation has seriously affected its development. Therefore, in order to improve the degree of completion, it is an effective way to study how to effectively predict the dropout in MOOCs and intervene in advance. Traditional methods rely on manually extracted features, which is difficult to guarantee the final prediction effect. In order to solve this problem, this paper proposes an integrated framework with feature selection (FSPred) to predict the dropout in MOOCs, which includes feature generation, feature selection, and dropout prediction. Specifically, FSPred applies a fine-grained feature generation method in days to generate features and then uses an ensemble feature selection method to select valid features and feed them into a logistic regression model for prediction. Extensive experiments on a public data set have shown that FSPred can achieve the comparable results with other dropout prediction methods in terms of precision, recall, F1 score, and AUC score. Finally, through the analysis of the features of the final selection, the suggestions for the construction of the MOOCs are put forward.

  • Research Article
  • Cite Count Icon 28
  • 10.1016/j.compeleceng.2022.108409
MOOC dropout prediction using a fusion deep model based on behaviour features
  • Oct 15, 2022
  • Computers and Electrical Engineering
  • Yafeng Zheng + 4 more

MOOC dropout prediction using a fusion deep model based on behaviour features

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant