Optimizing Student Performance Prediction: A Comparative Analysis of Regression Algorithms and Feature Selection Techniques on LMS Log Data
This study investigates the predictive power of learning management system (LMS) log data for student performance in higher education. Analyzing interactions from 114 students in a sports pedagogy course, we compared linear regression (LR), random forest regression (RFR), and support vector regression (SVR), each paired with mutual information (MI) and backward elimination (BE) feature selection. Results show LMS log data alone can effectively predict final grades, with SVR[Formula: see text]BE performing best ([Formula: see text], MAE [Formula: see text] 4.54). Feature selection, particularly BE, consistently improved model performance across all algorithms. Key findings include: LMS interactions strongly predict academic performance; SVR outperforms other algorithms in capturing complex educational data relationships; and BE’s superiority highlights the importance of feature interactions. This research advances educational data mining (EDM) by identifying optimal modeling approaches for LMS data, contributing to the development of early warning systems in online and blended learning environments.
- Research Article
- 10.21015/vtse.v13i2.2111
- May 4, 2025
- VFAST Transactions on Software Engineering
With the advent of digitalization, education-related activities have started generating massive amounts of data from various facets, such as student interaction, assessment, and learning management systems. Such vast amounts of data become suitable areas for Educational Data Mining (EDM) to reveal insights for actionable improvement in academic outcomes and personalized learning experiences. However, high dimensionality and the redundancy of the educational data also pose considerable threats to the accuracy, interpretability, and computational efficiency of modeling. Least Absolute Shrinkage and Selection Operator (LASSO) is one powerful technique for simultaneous regression and feature selection. By introducing sparsity, LASSO minimizes the absolute sum of regression coefficients, thereby forcing insignificant features to be reduced to zero automatically. This feature is handy in EDM, where relevant indicators such as attendance, quiz scores, or study patterns must be distinguished from noisy or redundant variables. This paper systematically investigates the application of LASSO in EDM by giving the mathematical background and geometric interpretation, along with practical usage recommendations. Also, LASSO performance has been checked on synthetic and real datasets, including the famous dataset UCI Student Performance. The findings prove that LASSO significantly enhances model interpretability, predictive accuracy, and a decline in complexity. In conclusion, limitations are discussed, as well as practical considerations and future directions for LASSO applications to next-generation educational analytics.
- Research Article
1
- 10.32628/cseit2390641
- Mar 14, 2024
- International Journal of Scientific Research in Computer Science, Engineering and Information Technology
Feature selection is one of the important data pre-processing techniques that are used to increase the performance of machine learning models, to build faster and more cost-effective algorithms, and to make it easier to interpret the predictions made by the models. The main objective of this research work is to investigate the influence features to predict particulate matter (PM10). This research uses 24-hour average pollutant concentration data of 36 air quality monitoring stations provided by Gandhinagar Smart City Development Limited (GSCDL), Gandhinagar, Gujarat. Important features were identified using five feature selection techniques (correlation, forward selection, backward elimination, Exhaustive Feature Selection (EFS), and feature importance derived using Random Forest Regressor). With selected features six regression algorithms (Multiple Linear Regression, Random Forest, Decision Tree, K-nearest Neighbour, XGBoost, and Support Vector Regressor) were trained to predict PM10. Further, the models were compared based on the Root Mean Square Error (RMSE) and Coefficient of determination (R2) parameters to identify the model with good performance. This proposed model can be utilized as an early warning system, providing air quality information to local authorities to develop air-quality improvement initiatives.
- Research Article
70
- 10.5194/bg-14-5551-2017
- Dec 8, 2017
- Biogeosciences
Abstract. The Southern Ocean accounts for 40 % of oceanic CO2 uptake, but the estimates are bound by large uncertainties due to a paucity in observations. Gap-filling empirical methods have been used to good effect to approximate pCO2 from satellite observable variables in other parts of the ocean, but many of these methods are not in agreement in the Southern Ocean. In this study we propose two additional methods that perform well in the Southern Ocean: support vector regression (SVR) and random forest regression (RFR). The methods are used to estimate ΔpCO2 in the Southern Ocean based on SOCAT v3, achieving similar trends to the SOM-FFN method by Landschützer et al. (2014). Results show that the SOM-FFN and RFR approaches have RMSEs of similar magnitude (14.84 and 16.45 µatm, where 1 atm = 101 325 Pa) where the SVR method has a larger RMSE (24.40 µatm). However, the larger errors for SVR and RFR are, in part, due to an increase in coastal observations from SOCAT v2 to v3, where the SOM-FFN method used v2 data. The success of both SOM-FFN and RFR depends on the ability to adapt to different modes of variability. The SOM-FFN achieves this by having independent regression models for each cluster, while this flexibility is intrinsic to the RFR method. Analyses of the estimates shows that the SVR and RFR's respective sensitivity and robustness to outliers define the outcome significantly. Further analyses on the methods were performed by using a synthetic dataset to assess the following: which method (RFR or SVR) has the best performance? What is the effect of using time, latitude and longitude as proxy variables on ΔpCO2? What is the impact of the sampling bias in the SOCAT v3 dataset on the estimates? We find that while RFR is indeed better than SVR, the ensemble of the two methods outperforms either one, due to complementary strengths and weaknesses of the methods. Results also show that for the RFR and SVR implementations, it is better to include coordinates as proxy variables as RMSE scores are lowered and the phasing of the seasonal cycle is more accurate. Lastly, we show that there is only a weak bias due to undersampling. The synthetic data provide a useful framework to test methods in regions of sparse data coverage and show potential as a useful tool to evaluate methods in future studies.
- Research Article
1
- 10.24113/ijosthe.v7i4.132
- Sep 25, 2020
In recent years, research on Educational Data Mining (EDM) has developed rapidly. However, most researches focus on data source issues, and ignore the importance of data preprocessing and data mining algorithms. This paper has studied EDM, with a special focus on educational big data mining algorithms. Firstly, it analyzed the relevant elements of EDM and introduces big data technology based on the requirements of educational data application. Then it introduced the common educational big data mining algorithms and their applications, and finally discussed the development trend of educational big data mining algorithms.
- Research Article
4
- 10.3390/w17030434
- Feb 4, 2025
- Water
Ice-jam floods (IJFs) are a significant hydrological phenomenon in the upper reaches of the Heilongjiang River, posing substantial threats to public safety and property. This study employed various feature selection techniques, including the Pearson correlation coefficient (PCC), Grey Relational Analysis (GRA), mutual information (MI), and stepwise regression (SR), to identify key predictors of river ice break-up dates. Based on this, we constructed various machine learning models, including Extreme Gradient Boosting (XGBoost), Backpropagation Neural Network (BPNN), Random Forest (RF), and Support Vector Regression (SVR). The results indicate that the ice reserves in the Oupu to Heihe section have the most significant impact on the ice break-up date in the Heihe section. Additionally, the accumulated temperature during the break-up period and average temperature before river ice break-up are identified as features closely related to the river’s opening in all four feature selection methods. The choice of feature selection method notably impacts the performance of the machine learning models in predicting the river ice break-up dates. Among the models tested, XGBoost with PCC-based feature selection achieved the highest accuracy (RMSE = 2.074, MAE = 1.571, R2 = 0.784, NSE = 0.756, TSS = 0.950). This study provides a more accurate and effective method for predicting river ice break-up dates, offering a scientific basis for preventing and managing IJF disasters.
- Research Article
11
- 10.12928/telkomnika.v18i3.14802
- Jun 1, 2020
- TELKOMNIKA (Telecommunication Computing Electronics and Control)
Supporting the goal of higher education to produce graduation who will be a professional leader is a crucial. Most of universities implement intelligent information system (IIS) to support in achieving their vision and mission. One of the features of IIS is student performance prediction. By implementing data mining model in IIS, this feature could precisely predict the student’ grade for their enrolled subjects. Moreover, it can recognize at-risk students and allow top educational management to take educative interventions in order to succeed academically. In this research, multi-regression model was proposed to build model for every student. In our model, learning management system (LMS) activity logs were computed. Based on the testing result on big students datasets, courses, and activities indicates that these models could improve the accuracy of prediction model by over 15%.
- Book Chapter
19
- 10.1007/978-3-319-02738-8_4
- Nov 7, 2013
Identifying students’ behavior in university is a great concern to the higher education managements (Kumar and Uma, Eur J Sci Res 34(4):526–534). This chapter proposes a new educational technology system for use in Knowledge Discovery Processes (KDP). We introduce the educational data mining (EDM) software and present the outcome of a test on university data to explore the factors having an impact on the success of the students based on student profiling. In our software system all the tasks involved in the KDP are realized together. The advantage of this approach is to have access to all the functionalities of the Structured Query Language (SQL) Server and the Analysis Services through a single developed software item, which is specific to the needs of a higher education institution. This model (Guruler et al., Comput Educ 55(1):247–254) aims to help educational organizations to better understand the KDPs, and provides a roadmap to follow while executing whole knowledge projects, which are nontrivial, involve multiple stages, possibly several iterations.
- Research Article
2
- 10.3389/conf.fnhum.2019.229.00022
- Jan 1, 2019
- Frontiers in Human Neuroscience
Frontiers Events is a rapidly growing calendar management system dedicated to the scheduling of academic events. This includes announcements and invitations, participant listings and search functionality, abstract handling and publication, related events and post-event exchanges. Whether an organizer or participant, make your event a Frontiers Event!
- Conference Article
19
- 10.1109/csei50228.2020.9142529
- Jun 1, 2020
In recent years,research on Educational Data Mining (EDM) has developed rapidly. However, most researches focus on data source issues, and ignore the importance of data preprocessing and data mining algorithms. This paper has studied EDM, with a special focus on educational big data mining algorithms. Firstly, it analyzed the relevant elements of EDM and introduces big data technology based on the requirements of educational data application. Then it introduced the common educational big data mining algorithms and their applications, and finally discussed the development trend of educational big data mining algorithms.
- Research Article
29
- 10.1080/13562517.2012.753049
- Jul 1, 2013
- Teaching in Higher Education
In this paper, our aim is to explore the predictors of adoption by students of a Learning Management System (LMS) based on a Modular Object-Oriented Dynamic Learning Environment as well as the influence of active student participation and the interactive usage of an LMS on the achievements of students in a blended learning environment. Our study was conducted on 169 students, who are using an LMS for the first time in their studies, from the largest university from Serbia. Our findings indicate that students' active participation in class has a stronger positive effect on students' achievement than does students' interactive usage of the LMS. A stepwise linear regression analysis revealed that a student's interactive usage of the LMS and his/her active participation in class accounted for 47% of the variation in a student's achievement. A student's interactive usage of the LMS is only affected by his/her perceived easy usage of the LMS.
- Research Article
9
- 10.3389/fpls.2022.821365
- Feb 11, 2022
- Frontiers in Plant Science
Floods, as one of the most common disasters in the natural environment, have caused huge losses to human life and property. Predicting the flood resistance of poplar can effectively help researchers select seedlings scientifically and resist floods precisely. Using machine learning algorithms, models of poplar’s waterlogging tolerance were established and evaluated. First of all, the evaluation indexes of poplar’s waterlogging tolerance were analyzed and determined. Then, significance testing, correlation analysis, and three feature selection algorithms (Hierarchical clustering, Lasso, and Stepwise regression) were used to screen photosynthesis, chlorophyll fluorescence, and environmental parameters. Based on this, four machine learning methods, BP neural network regression (BPR), extreme learning machine regression (ELMR), support vector regression (SVR), and random forest regression (RFR) were used to predict the flood resistance of poplar. The results show that random forest regression (RFR) and support vector regression (SVR) have high precision. On the test set, the coefficient of determination (R2) is 0.8351 and 0.6864, the root mean square error (RMSE) is 0.2016 and 0.2780, and the mean absolute error (MAE) is 0.1782 and 0.2031, respectively. Therefore, random forest regression (RFR) and support vector regression (SVR) can be given priority to predict poplar flood resistance.
- Research Article
10
- 10.1016/j.biosystemseng.2021.11.021
- Dec 9, 2021
- Biosystems Engineering
Estimating the total nitrogen content of Aquilaria sinensis leaves based on a hybrid feature selection algorithm and image data from a modified digital camera
- Conference Article
4
- 10.1109/ictmod52902.2021.9739579
- Nov 24, 2021
This research study aims to evaluate the significance of Technology and Industry 4.0 for Student Performance in Higher Education. Industry 4.0 is part of digital revolution which amalgamates various technologies like AI, distributed computing, virtual reality (VR), Internet of Things (IoT) & Big Data to bring a fundamental transformation in the current industry. The integration of these technologies has benefited all domains of society including Education. Education 4.0 aims to use Industry Revolution 4.0 technologies to the benefit of education field by providing means to improve the education sector using techniques like Education Mining, Prediction and Prescription of student's performance during their learning duration at universities. This study tries to highlight some important literature in the area of Industrial Revolution 4.0, Education 4.0, Big Data, Machine Learning, Descriptive, Predictive and Prescriptive analysis as well as learning analytics tools to provide a guidance for the stakeholders of the Education Industry to enrich their process for getting improved student performances at risk.
- Research Article
- 10.36348/sjbms.2025.v10i06.009
- Jul 31, 2025
- Saudi Journal of Business and Management Studies
Student enrolment, financial challenges, technology integration, and curriculum diversification have increasingly competition among higher education institutions. The ideal future workforce must possess not only technical expertise but also strong skills in complex problem solving, critical thinking, creativity, human resource management, and teamwork. In addition to analytical and leadership capabilities, these competencies are essential for thriving in a rapidly evolving digital economy. However, limited study has been conducted to assess Indonesia's readiness to engage with this digital transformation. The aim of this study to examine the correlation of 4.0 educational adaptation and school management on student performance in higher education in Borobudur University. This study uses applied research with a cross-sectional design to examine the impact of technological infrastructure and faculty management on student’s performance. The population consists of employee at Borobudur University with a sample 40 respondents including leaders, lecturer and education staff. The result found that, the bivariate analysis of technological infrastructure, strategic planning and policy making, operational management, student assessment have significant relationship with performance. p value 0.000. The final model the variable technological structure significant correlation and operational management as confounding factor of student assessment R 0.603, RR 0,364 (36.4%) VIF 2.955.; Strategic planning and policy making was significant with Student performance and technological infrastructure as confounding factor with student performance R 00,609, R2 0,371 (37.1%), VIF 2,277. Student performance, student assessment significant correlation with Student Performance R 0,460, R2 0211 (21,1%), p 0.003, VIF 1.000. Conclusion technological infrastructure and operational management correlation with student assessment; strategic planning and policy and technological infrastructure correlation with student performance; student performance significant correlation with student performance.
- Dissertation
- 10.53846/goediss-6040
- Feb 21, 2022
The thesis “Student Performance in Higher Education: Ability, Class Attendance, Mobility and the Bologna Process” empirically analyzes determinants of students’ success at university. Administrative student data as well as survey data collected at Göttingen University, Germany are used. Chapter 2 identifies individual and institutional factors, for example the high school leaving grade or the faculty a student is enrolled at, and analyzes their impact on academic performance. In this context, academic performance is measured in three dimensions: the probability of obtaining any degree at university, the probability of obtaining a degree within a chosen field of study and the grade of the final university degree. Two main results emerge: Firstly, the high school leaving grade is by far the most important individual determinant of students’ success at university. In contrast, criteria such as social origin or gender only play a minor role. Secondly, there are substantial differences between faculties implying that institutional factors also influence academic performance. Chapter 3 evaluates whether attending the lecture and/or tutorial in two basic courses in business administration and economics has an impact on the achieved grade. The analysis finds no significant effect of class attendance on university performance in most specifications. Although identifying a causal effect may not be possible with the data at hand, the result allows the conclusion that going to class and studying on one’s own may be substitutes in the given framework. Chapter 4 focuses on bachelor students to analyze whether a study-related visit abroad influences university outcomes. In this context, university outcomes are measured by the final grade of the bachelor degree and the probability of graduating within the standard time period. A propensity score matching strategy is applied to overcome the potential problem of self-selection into studying abroad. The analysis shows that a sojourn improves the final university grade. However, the result seems mainly to be driven by selective transferring of grades. In addition, bachelor students who do a study-related visit abroad have a lower probability of graduating within the standard time period than their peers who stay at the home institution. This supports the idea that students do not count all grades achieved abroad towards their degree at home. Finally, Chapter 5 is devoted to the Bologna process. It evaluates the effect of replacing traditional five-year degrees (Magister, Diplom, old teacher degree) with three-year bachelor programs on the duration until graduation and the timing of university drop-out. Competing risks models are estimated using a relative time measure that makes information on duration between old and new study programs comparable. The analysis shows that the Bologna process reduced the duration until students achieve their first university degree both in absolute and relative terms. However, concerning the timing of university drop-out, the results are less conclusive. Only for the faculty of humanities there is a clear effect of the Bologna process on the probability of dropping out of university.