Ensemble Regression Models Applied to Dropout in Higher Education

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Context: School dropout is a significant challenge for the Brazilian education system. Several factors need to be corrected, and others eliminated so that students can to have access to higher education and guarantee the completion of their courses. Motivation: finding the best model to predict a specific problem is not a simple task. It's because the phenomena involved are not known, or are sophisticated modeling. Thus, combining models often produces better accuracy than individual models. Different models use this combination approach and have been applied in the context of Data Mining (MD), for prediction and classification. Objective: we propose in this study three different models to predict school dropout. These are based on Ensemble Regression. We apply the models in the context of the Brazilian Higher Education Institutions. Besides, it may help in the identification of the factors associated with dropout. For this, we used two techniques for the attribute selection: Stepwise and Pearson correlation. That techniques determine the factors related to dropout. Methodology: we used the data from the Census and Flow Indicators Higher Education. The methodology is based on CRISP-DM to understand, prepare, and model the data. We used predictive bagging methods to make a model to predict dropout. Results: the ensemble regression models proposed obtained better performance compared model literature. The ensemble model based on bagging of linear regression had a smaller prediction error. Besides, the models proposed in this study will help the educational administrators and policymakers working within the educational sector in the development of new policies that are relevant to student retention. But, the global implications of this research to practice is its ability to help in early identifying factories associated with students at risk of dropout of High Education.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.inoche.2023.111498
New combined approach for prediction of stability constants of metal–ligand complexes using thermodynamic radii of metal ions and ensembles of regression models
  • Sep 26, 2023
  • Inorganic Chemistry Communications
  • Vitaly Solov'Ev + 1 more

New combined approach for prediction of stability constants of metal–ligand complexes using thermodynamic radii of metal ions and ensembles of regression models

  • Research Article
  • 10.5815/ijitcs.2025.03.07
Forecasting Agriculture Commodity Price Trend using Novel Competitive Ensemble Regression Model
  • Jun 8, 2025
  • International Journal of Information Technology and Computer Science
  • R Ragunath + 1 more

This paper introduces a novel approach for forecasting the price trends of agricultural commodities to address the issue of price volatility faced by both farmers and consumers. The accurate forecasting of food prices is particularly crucial in emerging nations such as India where food security is a top priority. To achieve this goal, the paper presents an ensemble learning-based approach for predicting the agricultural commodity price (ACP) trend. Using dataset namely rainfall and wholesale pricing index (WPI), the study compares the performance of various individual and ensemble regression models. The findings of this work demonstrated that the novel competitive ensemble regression (CER) approach outperforms traditional individual regression models in predicting price fluctuations trend accurately. This approach has the high potential and more precise prediction to afford farmers and dealers, also make the model suitable for the financial industries.

  • Research Article
  • Cite Count Icon 14
  • 10.1134/s0040601519030042
Model for Early Detection of Emergency Conditions in Power Plant Equipment Based on Machine Learning Methods
  • Mar 1, 2019
  • Thermal Engineering
  • A A Korshikova + 1 more

The article discusses a method for early detection and prediction of abnormality in operation of power-unit process equipment taking as an example the PTN 1100-350-17-4 turbine driven feedwater pump of a 300 MW power unit. The importance of the problem of predicting possible process equipment malfunctions at an early state of their occurrence is determined, and the specific features of solving it in the power industry are explained. The range of process equipment defects that can be efficiently detected using the predictive analytics methods is outlined. The fundamental assertion stating that the scope of analog and discrete measurements available in the process control system’s set of computerized automation tools is sufficient for applying the predictive analytics methods is emphasized. Modern predictive analytics methods are briefly reviewed, and the specific features of model training algorithms are mentioned. Separate attention is paid to the problems of preparing initial data for training the model. The mathematical problem of modeling an abnormality indicator taking the values from 0 (normal operation) to 1 (abnormal operation) is formulated. In turn, this problem is formulated as the binary classification problem of attribute vectors characterizing the equipment state at the given moment of time. An original approach is suggested, which combines the multivariate state estimation technique (MSET), in which the degree of abnormality in a technical state is determined from the extent to which the Hotelling criterion exceeds a threshold level (which is automatically calculated in the algorithm), and machine learning methods, the use of which makes it possible to overcome a number of difficulties inherent in the MSET. For solving the problem of determining the composition of the most informative attributes from the values of which early development of an emergency can be detected, it is proposed to use an ensemble of regression models. A method for selecting the modeled variable and the set of regressors is substantiated. An abnormality indicator calculation method based on composing an ensemble of linear regression models is proposed, and the advantage of using an ensemble over a single classifier is shown. A method for producing an alarm in response to detected abnormality in the operation of power unit process equipment is suggested. It is shown that it became possible by using the proposed model to detect the onset of the emergency development process, whereas individual indicators failed to reveal pump operation singularities in the preemergency interval of time.

  • Research Article
  • Cite Count Icon 1
  • 10.1371/journal.pone.0328213
Ensemble machine learning prediction model for clinical refraction using partial interferometry measurements in childhood.
  • Jul 10, 2025
  • PloS one
  • Sa Ra Kim + 3 more

To develop an ensemble machine learning prediction model for clinical refraction in childhood using partial interferometry measurements. Age, sex, cycloplegic refraction, and partial interferometry data collected within one month were obtained from patients aged 5-16 years, retrospectively. Four ensemble regression models were used to develop prediction models of spherical equivalents (SE) from the collected data. Root mean squared error (RMSE) was used to compare the accuracy among the models. The accuracy of the ensemble models was compared with that of a previously developed multiple linear regression model. 4156 eyes from 1965 patients (50.3% female) were included. Mean age was 8.4 ± 2.3 years and mean SE was -1.01 ± 2.94 diopters. Mean axial length was 23.63 ± 1.41 mm and mean keratometry reading of flat and steep axis was 43.58 ± 1.40 diopters. Developed ensemble models had accuracy of RMSE 0.800 to 0.829 diopters, which was superior to that of the conventional regression model (1.213 diopters). Simulations with the same biometric parameters showed that female sex was associated more with myopia than that of male sex. Long eyes showed dampened increase in the myopic refraction per unit axial length. Refractive errors can be calculated in the childhood using these ensemble models with ocular biometric parameters. Moreover, the models were able to simulate hypothetical relationships between ocular parameters and SE to understand the nature of clinical refraction.

  • Research Article
  • Cite Count Icon 52
  • 10.1016/j.eswa.2015.07.022
Intelligent affect regression for bodily expressions using hybrid particle swarm optimization and adaptive ensembles
  • Jul 21, 2015
  • Expert Systems with Applications
  • Yang Zhang + 4 more

Intelligent affect regression for bodily expressions using hybrid particle swarm optimization and adaptive ensembles

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3724/sp.j.1123.2020.06011
Ensemble hologram quantitative structure activity relationship model of the chromatographic retention index of aldehydes and ketones
  • Mar 1, 2021
  • Se pu = Chinese journal of chromatography
  • Bin Lei + 6 more

色谱保留指数(retention index, RI)是色谱分析中的重要参数,不同化合物在不同极性固定相上具有不同的保留行为。醛酮化合物种类众多,实验测定其RI值的时间和经济成本高。该论文采用集成建模(ensemble modeling)结合全息定量构效关系(HQSAR)方法研究了醛酮化合物在2种固定相(DB-210和HP-Innowax)上色谱保留指数的定量构效关系(QSAR)模型。用外部测试集验证法和留一交叉验证法评估了所建立模型的预测能力。首先建立了34种被研究化合物的个体HQSAR模型。在固定相DB-210上,片段特性(FD)为“供体/受体原子(DA)”且片段尺寸(FS)为1~9时可得到最优个体模型,在固定相HP-Innowax上,FD为“DA”且FS为4~7时可得到最优个体模型,这两个模型的交叉验证相关系数( )分别为0.935和0.909,外部验证相关系数( )分别为0.925和0.927,一致性相关系数(CCC)分别为0.953和0.960,预测平方相关系数F2( )分别为0.922和0.918,预测平方相关系数F3( )分别为0.931和0.927。研究结果表明醛酮化合物的分子结构与RI值之间存在定量关系,用HQSAR方法可以建立二者之间的QSAR模型。其次,以4个预测准确度最高的个体HQSAR模型作为子模型通过算术平均建立了集成HQSAR模型。建立的集成HQSAR模型预测被研究化合物在DB-210和HP-Innowax固定相上RI值的 分别为0.927和0.919, 分别为0.929和0.963, CCC分别为0.956和0.979, 分别为0.927和0.958, 分别为0.935和0.963。与个体HQSAR模型相比,建立的集成HQSAR模型预测准确度更高。这说明集成建模是提高HQSAR模型预测能力的有效方法,HQSAR与集成建模方法相结合可以用于研究和预测醛酮化合物的RI值。

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 44
  • 10.3390/app12073605
Prediction of Self-Healing of Engineered Cementitious Composite Using Machine Learning Approaches
  • Apr 1, 2022
  • Applied Sciences
  • Guangwei Chen + 4 more

Engineered cementitious composite (ECC) is a unique material, which can significantly contribute to self-healing based on ongoing hydration. However, it is difficult to model and predict the self-healing performance of ECC. Although different machine learning (ML) algorithms have been utilized to predict several properties of concrete, the application of ML on self-healing prediction is considerably rare. This paper aims to provide a comparative analysis on the performance of various machine learning models in predicting the self-healing capability of ECC. These models include four individual methods, linear regression (LR), back-propagation neural network (BPNN), classification and regression tree (CART), and support vector regression (SVR). To improve prediction accuracy, three ensemble methods, namely bagging, AdaBoost, and stacking, were also studied. A series of experimental works on the self-healing performance of ECC samples was conducted, and the results were used to develop and compare the accuracy among the ML models. The comparison results showed that the Stack_LR model had the best predictive performance, showing the highest coefficient of determination (R2), the lowest root-mean-squared error (RMSE), and the smallest prediction error (MAE). Among all individual models studies, the BPNN model performed the best in terms of the RMSE and R2, while SVR performed the best in terms of the MAE. Furthermore, SVR had the smallest prediction error (MAE) for crack widths less than 60 μm or greater than 100 μm, while CART had the smallest prediction error (MAE) for crack widths between 60 μm and 100 μm. The study concluded that the individual and ensemble methods can be used to predict the self-healing of ECC. Ensemble models were able to improve the accuracy of prediction compared to the individual model used as their base learner, i.e., a 2.3% to 4.9% reduction in MAE. However, selecting an appropriate individual and ensemble method is critical. To improve the performance accuracy, researchers should employ different ensemble methods to compare their effectiveness with different ML models.

  • Research Article
  • Cite Count Icon 18
  • 10.1007/s10278-019-00289-x
Performance Comparison of Individual and Ensemble CNN Models for the Classification of Brain 18F-FDG-PET Scans.
  • Oct 28, 2019
  • Journal of Digital Imaging
  • Tomomi Nobashi + 7 more

Two hundred eighty-nine brain FDG-PET scans (N; n = 150, A; n = 139) resulting in a total of 68,260 images were included. Nine individual 2D-CNN models with three different window settings for axial, coronal, and sagittal axes were trained and validated. The performance of these individual and ensemble models was evaluated and compared using a test dataset. Odds ratio, Akaike's information criterion (AIC), and area under curve (AUC) on receiver-operative-characteristic curve, accuracy, and standard deviation (SD) were calculated. An optimal window setting to classify normal and abnormal scans was different for each axis of the individual models. An ensembled model using different axes with an optimized window setting (window-triad) showed better performance than ensembled models using the same axis and different windows settings (axis-triad). Increase in odds ratio and decrease in SD were observed in both axis-triad and window-triad models compared with individual models, whereas improvements of AUC and AIC were seen in window-triad models. An overall model averaging the probabilities of all individual models showed the best accuracy of 82.0%. Data ensemble using different window settings and axes was effective to improve 2D-CNN performance parameters for the classification of brain FDG-PET scans. If prospectively validated with a larger cohort of patients, similar models could provide decision support in a clinical setting.

  • Research Article
  • Cite Count Icon 10
  • 10.1002/cl2.66
PROTOCOL: Dropout Prevention and Intervention Programs: Effects on School Completion and Dropout Among School‐aged Children and Youth
  • Jan 1, 2010
  • Campbell Systematic Reviews
  • Sandra Jo Wilson + 4 more

With the expansion of regional and national economies into a global marketplace, education has even greater importance as a primary factor in allowing young adults to enter the workforce and advance economically, as well as to share in the social, health, and other benefits associated with education and productive careers.Dropping out of school before completing the normal course of secondary education greatly undermines these opportunities and is associated with adverse personal and social consequences.Dropout rates in the United States vary by calculation method, state, ethnic background, and socioeconomic status (Cataldi, Laird, & KewelRamani, 2009).Across all states, the percentage of freshman who did not graduate from high school in four years ranges from 13.1% to 44.2% and averages 26.8%.The status dropout rate, which estimates the percentage of individuals in a certain age range who are not in high school and have not earned a diploma or credential, is slightly lower.In October 2007, the proportion of noninstitutionalized 18-24 year olds not in school without a diploma or certificate was 8.7%.Males are more likely to be dropouts than females (9.8% vs. 7.7%).Status dropout rates are much higher for racial/ethnic minorities (21.4% for Hispanics and 8.4% for Blacks vs. 5.3% for Whites).Event dropout rates illustrate single year dropout rates for high school students and show that students from low-income households drop out of high school more frequently than those from more advantaged backgrounds (8.8% for low-income vs. 3.5% for middle income and 0.9% for high income students).The National Dropout Prevention Center/Network reports that school dropouts in the United States earn an average of $9,245 a year less than those who complete high school, have unemployment rates almost 13 percentage points higher than high school graduates, are disproportionately represented in prison populations, are more likely to become teen parents, and more frequently live in poverty (2009).The consequences of school dropout are even worse for minority youth, further exacerbating the economic and structural disadvantage they often experience.School dropout has implications not only for the lives and opportunities of those who experience it, but also has enormous economic and social implications for society at large.For instance,

  • PDF Download Icon
  • Peer Review Report
  • 10.5194/bg-2022-4-ac1
Reply on RC1
  • Apr 26, 2022
  • Z George Xue

<strong class="journal-contentHeaderColor">Abstract.</strong> This study presents a novel ensemble regression model for forecasts of the hypoxic area (HA) in the Louisiana–Texas (LaTex) shelf. The ensemble model combines a zero-inflated Poisson generalized linear model (GLM) and a quasi-Poisson generalized additive model (GAM) and considers predictors with hydrodynamic and biochemical features. Both models were trained and calibrated using the daily hindcast (2007–2020) by a three-dimensional coupled hydrodynamic–biogeochemical model embedded in the Regional Ocean Modeling System (ROMS). Compared to the ROMS hindcasts, the ensemble model yields a low root-mean-square error (RMSE) (3256 km<span class="inline-formula"><sup>2</sup></span>), a high <span class="inline-formula"><i>R</i><sup>2</sup></span> (0.7721), and low mean absolute percentage biases for overall (29 %) and peak HA prediction (25 %). When compared to the shelf-wide cruise observations from 2012 to 2020, our ensemble model provides a more accurate summer HA forecast than any existing forecast models with a high <span class="inline-formula"><i>R</i><sup>2</sup></span> (0.9200); a low RMSE (2005 km<span class="inline-formula"><sup>2</sup></span>); a low scatter index (15 %); and low mean absolute percentage biases for overall (18 %), fair-weather summer (15 %), and windy-summer (18 %) predictions. To test its robustness, the model is further applied to a global forecast model and produces HA prediction from 2012–2020 with the adjusted predictors from the HYbrid Coordinate Ocean Model (HYCOM). In addition, model sensitivity tests suggest an aggressive riverine nutrient reduction strategy (92 %) is needed to achieve the HA reduction goal of 5000 km<span class="inline-formula"><sup>2</sup></span>.

  • PDF Download Icon
  • Peer Review Report
  • 10.5194/bg-2022-4-rc1
Comment on bg-2022-4
  • Mar 28, 2022

In this study, a novel ensemble regression model was developed for hypoxic area (HA) forecast in the Louisiana–Texas (LaTex) Shelf. The ensemble model combines a zero-inflated Poisson generalized linear model (GLM) and a quasi-Poisson generalized additive model (GAM) and considers predictors with hydrodynamic and biochemical features. Both models were trained and calibrated using the daily hindcast (2007–2020) by a three-dimensional coupled hydrodynamic–biogeochemical model embedded in the Reginal Ocean Modeling System (ROMS). A promising HA forecast is provided by the ensemble model with a low RMSE (3,204 km2), a high R2 (0.8005), and a precise performance in capturing hypoxic area peaks in the summers. To test its robustness, the model was further applied to a global forecast model and produces HA prediction from 2019 to 2020 with the adjusted predictors from the HYbrid Coordinate Ocean Model (HYCOM). Predicted HA shows a high agreement with the ROMS hindcast time series (RMSE = 4,571 km2, R2 = 0.8178). Our model can also predict the magnitude and onsets of summer HA peaks in both 2019 and 2020 with high accuracy. To the best of our knowledge, this ensemble model is by far the first one providing fast and accurate daily HA predictions for the LaTex Shelf while considering both hydrodynamic and biochemical effects. This study demonstrates that it is feasible to perform regional ocean HA prediction using global ocean forecast.

  • Research Article
  • 10.22075/ijnaa.2021.4796
Predicting the number of comments on facebook posts using an ensemble regression model
  • Jan 1, 2021
  • International Journal of Nonlinear Analysis and Applications
  • Omid Rahmani Seryasat + 3 more

The nature and importance of user’s comments in various social media systems play an important role in creating or changing people's perceptions of certain topics or popularizing them. It has now an important place in various fields, including education, sales, prediction, and so on. In this paper, Facebook social network has been considered as a case study. The purpose of this study is to predict the volume of Facebook users' comments on the published content called post. Therefore, the existing problem is classified as a regression problem. In the method presented in this paper, three regression models called elastic network, M5P model, and radial basis function regression model are combined and an ensemble model is made to predict the volume of comments. In order to combine these base models, a strategy called stack generalization is used, based on which the output of the base models is provided to a linear regression model as new features. This linear regression model combines the outputs of the 3 base models and determines the final output of the system. To evaluate the performance of the proposed model, a database of the UCI dataset, which has 5 training sets and 10 test sets, has been used. Each test set in this database has 100 records. In the present study, the efficiency of the base models and the proposed ensemble model is evaluated on all these sets. Finally, it is concluded that the use of the ensemble model can reduce the average correlation coefficient (as one of the evaluation criteria of the model) to 74.4 ± 16.4, which is an acceptable result.

  • PDF Download Icon
  • Peer Review Report
  • 10.5194/bg-2022-4-rc3
Comment on bg-2022-4
  • Apr 12, 2022

<strong class="journal-contentHeaderColor">Abstract.</strong> This study presents a novel ensemble regression model for forecasts of the hypoxic area (HA) in the Louisiana–Texas (LaTex) shelf. The ensemble model combines a zero-inflated Poisson generalized linear model (GLM) and a quasi-Poisson generalized additive model (GAM) and considers predictors with hydrodynamic and biochemical features. Both models were trained and calibrated using the daily hindcast (2007–2020) by a three-dimensional coupled hydrodynamic–biogeochemical model embedded in the Regional Ocean Modeling System (ROMS). Compared to the ROMS hindcasts, the ensemble model yields a low root-mean-square error (RMSE) (3256 km<span class="inline-formula"><sup>2</sup></span>), a high <span class="inline-formula"><i>R</i><sup>2</sup></span> (0.7721), and low mean absolute percentage biases for overall (29 %) and peak HA prediction (25 %). When compared to the shelf-wide cruise observations from 2012 to 2020, our ensemble model provides a more accurate summer HA forecast than any existing forecast models with a high <span class="inline-formula"><i>R</i><sup>2</sup></span> (0.9200); a low RMSE (2005 km<span class="inline-formula"><sup>2</sup></span>); a low scatter index (15 %); and low mean absolute percentage biases for overall (18 %), fair-weather summer (15 %), and windy-summer (18 %) predictions. To test its robustness, the model is further applied to a global forecast model and produces HA prediction from 2012–2020 with the adjusted predictors from the HYbrid Coordinate Ocean Model (HYCOM). In addition, model sensitivity tests suggest an aggressive riverine nutrient reduction strategy (92 %) is needed to achieve the HA reduction goal of 5000 km<span class="inline-formula"><sup>2</sup></span>.

  • PDF Download Icon
  • Peer Review Report
  • 10.5194/bg-2022-4-ac3
Reply on RC3
  • Apr 26, 2022
  • Z George Xue

In this study, a novel ensemble regression model was developed for hypoxic area (HA) forecast in the Louisiana–Texas (LaTex) Shelf. The ensemble model combines a zero-inflated Poisson generalized linear model (GLM) and a quasi-Poisson generalized additive model (GAM) and considers predictors with hydrodynamic and biochemical features. Both models were trained and calibrated using the daily hindcast (2007–2020) by a three-dimensional coupled hydrodynamic–biogeochemical model embedded in the Reginal Ocean Modeling System (ROMS). A promising HA forecast is provided by the ensemble model with a low RMSE (3,204 km2), a high R2 (0.8005), and a precise performance in capturing hypoxic area peaks in the summers. To test its robustness, the model was further applied to a global forecast model and produces HA prediction from 2019 to 2020 with the adjusted predictors from the HYbrid Coordinate Ocean Model (HYCOM). Predicted HA shows a high agreement with the ROMS hindcast time series (RMSE = 4,571 km2, R2 = 0.8178). Our model can also predict the magnitude and onsets of summer HA peaks in both 2019 and 2020 with high accuracy. To the best of our knowledge, this ensemble model is by far the first one providing fast and accurate daily HA predictions for the LaTex Shelf while considering both hydrodynamic and biochemical effects. This study demonstrates that it is feasible to perform regional ocean HA prediction using global ocean forecast.

  • PDF Download Icon
  • Peer Review Report
  • 10.5194/bg-2022-4-ac2
Reply on RC2
  • Apr 26, 2022
  • Z George Xue

In this study, a novel ensemble regression model was developed for hypoxic area (HA) forecast in the Louisiana–Texas (LaTex) Shelf. The ensemble model combines a zero-inflated Poisson generalized linear model (GLM) and a quasi-Poisson generalized additive model (GAM) and considers predictors with hydrodynamic and biochemical features. Both models were trained and calibrated using the daily hindcast (2007–2020) by a three-dimensional coupled hydrodynamic–biogeochemical model embedded in the Reginal Ocean Modeling System (ROMS). A promising HA forecast is provided by the ensemble model with a low RMSE (3,204 km2), a high R2 (0.8005), and a precise performance in capturing hypoxic area peaks in the summers. To test its robustness, the model was further applied to a global forecast model and produces HA prediction from 2019 to 2020 with the adjusted predictors from the HYbrid Coordinate Ocean Model (HYCOM). Predicted HA shows a high agreement with the ROMS hindcast time series (RMSE = 4,571 km2, R2 = 0.8178). Our model can also predict the magnitude and onsets of summer HA peaks in both 2019 and 2020 with high accuracy. To the best of our knowledge, this ensemble model is by far the first one providing fast and accurate daily HA predictions for the LaTex Shelf while considering both hydrodynamic and biochemical effects. This study demonstrates that it is feasible to perform regional ocean HA prediction using global ocean forecast.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.