This study addresses the challenge of student dropout in the Faculty of Sciences and Technologies of the National University of Caaguazú in Paraguay by constructing an early warning model based on academic factors. Employing a data science methodology, academic records were characterized and analyzed, using techniques such as cluster analysis and the elbow method to optimize student segmentation. Several predictive machine learning models were adjusted, including logistic regression, decision trees, and k-nearest neighbors, which were evaluated using precision, recall, and F1 Score metrics to determine their effectiveness in classifying academic statuses.With the cluster analysis, four well-defined clusters were identified. These were characterized through cluster analysis into: early dropout, late dropout, thesis stage and graduate.The models had an average performance of 88% accuracy. These models were trained only with academic data (grades obtained in the courses). The data used covers four careers from 2012 to 2021: Computer Engineering, Civil Engineering, Electronics Engineering and Electrical Engineering.An early warning model for student dropout in the Faculty of Sciences and Technologies was built, using estimates based on relevant academic factors extracted from the faculty's academic database.In the study carried out, an effective characterization of the academic database was achieved using advanced data science techniques. Initially, the elbow method was used to determine the optimal number of clusters, identifying four different groups. The student population was segmented into: graduates, early dropouts (students who dropped out before five years of their degree), late dropouts (those who left their degree after five years) and students who completed the curriculum but have yet to present their Project. End of Degree. This detailed analysis allowed us to better understand the academic distribution.Predictive machine learning models were tuned and evaluated using four different data configurations in a series of training and prediction experiments. The most effective method was the third experiment, which combined data from students in states 2 and 5 through the third year. This combination created a more homogeneous and representative data set of academic success, allowing the models to more accurately identify the patterns and key factors that predetermine successful academic outcomes.With the selection of the third experiment, for the different majors, the optimal models varied: for Computer Science, the best model turned out to be K-Nearest Neighbors (KNN) with an accuracy, precision and recall of 0.896 and F1 of 0.895 in contrast to the one that had The lowest performance was Decision Tree (DT) with an accuracy and recall of 0.793, a precision of 0.853 and F1 of 0.814; for Electricity and Civil, the Decision Tree (DT) model was the most effective, in electricity with an accuracy and recall of 0.980, a precision of 0.981, and F1 of 0.979, in the civil career with an accuracy and recall of 0.968 , a precision of 0.976 and F1 of 0.970, however, the one that had the lowest performance in the two races was KNN, respectively for the Electricity race with an accuracy and recall of 0.823, a precision of 0.805 and F1 of 0.814 and in the Civil race with accuracy, recall and F1 of 0.843 and a precision of 0.850; and for Electronics, Logistic Regression (RL) and K-Nearest Neighbors (KNN) demonstrated better performance with accuracy and recall of 0.888, with a precision of 0.898 and F1 of 0.882, unlike the Decision Tree Model (DT) demonstrated lower accuracy and recall of 0.777, improving a little in precision with 0.809 compared to the RF and SVM models that demonstrated a lower precision of 0.740. The conclusions highlight the performance of different models in early identification of at-risk students, further proposing the integration of socioeconomic and psychological factors for future research in the field.
Read full abstract