Predicting higher education tuition fees using machine learning methods

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

University education is a critical step in preparing young individuals for their future and shaping their careers. However, this educational service often entails high costs, requiring students and their families to bear significant financial burdens. Despite the growing importance of accurately estimating tuition fees-given their impact not only on families but also on university administration and national economies-there remains a noticeable gap in the literature regarding the application of advanced machine learning (ML) methods for tuition fee prediction.This study addresses this gap by employing and comparing various ML regression techniques, including Linear Regression, Lasso Regression, Random Forest, Decision Tree, Ridge Regression, XGBoost, and ANN, which have proven successful in related forecasting tasks but are underutilized in tuition fee estimation. After a rigorous data preprocessing phase on a comprehensive dataset, the empirical results demonstrate that XGBoost stands out as a highly effective model for predicting university tuition fees. The findings contribute to the literature by expanding the methodological toolkit for tuition fee estimation and provide valuable insights for students, university administrators, economists, and policymakers.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 23
  • 10.1038/s41598-024-56466-8
Machine learning study using 2020 SDHS data to determine poverty determinants in Somalia
  • Mar 12, 2024
  • Scientific Reports
  • Abdirizak A Hassan + 2 more

Extensive research has been conducted on poverty in developing countries using conventional regression analysis, which has limited prediction capability. This study aims to address this gap by applying advanced machine learning (ML) methods to predict poverty in Somalia. Utilizing data from the first-ever 2020 Somalia Demographic and Health Survey (SDHS), a cross-sectional study design is considered. ML methods, including random forest (RF), decision tree (DT), support vector machine (SVM), and logistic regression, are tested and applied using R software version 4.1.2, while conventional methods are analyzed using STATA version 17. Evaluation metrics, such as confusion matrix, accuracy, precision, sensitivity, specificity, recall, F1 score, and area under the receiver operating characteristic (AUROC), are employed to assess the performance of predictive models. The prevalence of poverty in Somalia is notable, with approximately seven out of ten Somalis living in poverty, making it one of the highest rates in the region. Among nomadic pastoralists, agro-pastoralists, and internally displaced persons (IDPs), the poverty average stands at 69%, while urban areas have a lower poverty rate of 60%. The accuracy of prediction ranged between 67.21% and 98.36% for the advanced ML methods, with the RF model demonstrating the best performance. The results reveal geographical region, household size, respondent age group, husband employment status, age of household head, and place of residence as the top six predictors of poverty in Somalia. The findings highlight the potential of ML methods to predict poverty and uncover hidden information that traditional statistical methods cannot detect, with the RF model identified as the best classifier for predicting poverty in Somalia.

  • Abstract
  • 10.1016/j.spinee.2021.05.333
P125. Development of a novel ensemble machine learning algorithm for prediction of complications and readmission after anterior cervical spinal fusion
  • Aug 10, 2021
  • The Spine Journal
  • Akash A Shah + 7 more

P125. Development of a novel ensemble machine learning algorithm for prediction of complications and readmission after anterior cervical spinal fusion

  • Abstract
  • 10.1016/j.spinee.2021.05.334
P126. Development of a novel ensemble machine learning algorithm for prediction of complications and readmission after posterior cervical spinal fusion
  • Aug 10, 2021
  • The Spine Journal
  • Akash A Shah + 7 more

P126. Development of a novel ensemble machine learning algorithm for prediction of complications and readmission after posterior cervical spinal fusion

  • Preprint Article
  • 10.5194/egusphere-egu25-3240
Optimal machine learning techniques for meteorological modeling of PM2.5 concentration in five major polluted cities of South-East Asia
  • Mar 18, 2025
  • Sedra Shafi + 1 more

The rapid decline in air quality across Southeast and Western Pacific Asia is occurring at an accelerated pace due to population growth and industrial development. The region’s Meteorological factors, including the monsoon seasonality, exert a significant influence on air pollution levels, particularly PM2.5 concentrations. In this study, we employ a statistical modeling approach to derive daily PM2.5 levels from meteorological parameters in five major polluted cities: Lahore (Pakistan), Delhi (India), Dhaka (Bangladesh), Hanoi (Vietnam), and Shanghai (China). The incorporated meteorological parameters are wind speed, barometric pressure, temperature, and rainfall, which are known to affect air pollution levels from 2020 to 2022. The statistical modeling was based on the comparative analysis of 35 different machine learning (ML) regression techniques with the purpose of selecting the algorithms most efficient for reconstructing and predicting PM2.5 levels from meteorological variables alone. Specifically, each ML regression model was trained to reconstruct daily PM2.5 levels in 2020–2021, and then used to reconstruct both missing daily PM2.5 levels in 2020–2021 and forecast the whole of 2022 using only the 2022 meteorological records. The results indicated that most of the daily and seasonal variability in daily PM2.5 levels could be reconstructed from meteorological conditions. However, the performance of the various ML models (as assessed by Root Mean Square Error tests) exhibited considerable variability. Among the tested models, the Ensembles Boosted Tree ML method demonstrated optimal efficiency during the training period (the first 2 years, 2020 and 2021) and it also was highly efficient in predicting the third year (2022) using only meteorological data. Additionaly, the Trilayer Neural Network ML method was found the most effective at reconstructing the data after 3 years of training and may therefore be preferred to fill in short periods of missing PM2.5 data. In contrast, our comparative analyses showed that the traditional multi-linear regression models under-performed in both constructing and predicting PM2.5 data. This study demonstrates the necessity and usefulness of assessing multiple ML regression methodologies for selecting which ones better perform for reconstructing the data of interest (in our case PM2.5 records) from their hypothesized constructors (in our case meteorological parameters). In particular, this study has highlighted the utility of using ML regression techniques for forecasting air quality and reconstructing missing pollution data, which is crucial for policy-making across South-East and Western-Pacific Asia regions, where only limited pollution monitoring infrastructure are available.

  • Research Article
  • Cite Count Icon 23
  • 10.1080/19475705.2023.2225691
Water depth estimation from Sentinel-2 imagery using advanced machine learning methods and explainable artificial intelligence
  • Jun 27, 2023
  • Geomatics, Natural Hazards and Risk
  • Vahideh Saeidi + 5 more

The estimation of water depth in coastal areas and shallow waters is crucial for marine management and monitoring. However, direct measurements using fieldwork methods can be costly and time-consuming. Therefore, remote sensing imagery is a promising source of geospatial information for coastal planning and development. To this end, this study investigates advanced machine learning (ML) methods and redesigned morphological profiles for water depth estimation using high-resolution Sentinel-2 satellite imagery. The proposed framework involves three main steps: (1) morphological feature generation, (2) model training using several ML methods (Decision Tree, Random Forest, eXtreme Gradient BOOSTing, Light Gradient Boosting Machine, Deep Neural Network, and CatBoost), and (3) model interpretation using eXplainable Artificial Intelligence (XAI). The performance of the proposed method was evaluated in two different coastal areas (port and jetty) with reference data from accurate hydrographic data (Echo-sounder and differential global positioning systems). The statistical analysis revealed that the proposed method had high efficiency for depth estimation of the coastal area, achieving a best R2 value of 0.96 and Root Mean Square Error (RMSE) of 0.27 m in water depth estimation in the shallow water of Chabahar Bay in the Oman Sea. Additionally, the higher impact and interaction of the morphological features were verified using XAI for water depth mapping.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 79
  • 10.3390/s20185055
Modified Red Blue Vegetation Index for Chlorophyll Estimation and Yield Prediction of Maize from Visible Images Captured by UAV
  • Sep 5, 2020
  • Sensors (Basel, Switzerland)
  • Yahui Guo + 8 more

The vegetation index (VI) has been successfully used to monitor the growth and to predict the yield of agricultural crops. In this paper, a long-term observation was conducted for the yield prediction of maize using an unmanned aerial vehicle (UAV) and estimations of chlorophyll contents using SPAD-502. A new vegetation index termed as modified red blue VI (MRBVI) was developed to monitor the growth and to predict the yields of maize by establishing relationships between MRBVI- and SPAD-502-based chlorophyll contents. The coefficients of determination (R2s) were 0.462 and 0.570 in chlorophyll contents’ estimations and yield predictions using MRBVI, and the results were relatively better than the results from the seven other commonly used VI approaches. All VIs during the different growth stages of maize were calculated and compared with the measured values of chlorophyll contents directly, and the relative error (RE) of MRBVI is the lowest at 0.355. Further, machine learning (ML) methods such as the backpropagation neural network model (BP), support vector machine (SVM), random forest (RF), and extreme learning machine (ELM) were adopted for predicting the yields of maize. All VIs calculated for each image captured during important phenological stages of maize were set as independent variables and the corresponding yields of each plot were defined as dependent variables. The ML models used the leave one out method (LOO), where the root mean square errors (RMSEs) were 2.157, 1.099, 1.146, and 1.698 (g/hundred grain weight) for BP, SVM, RF, and ELM. The mean absolute errors (MAEs) were 1.739, 0.886, 0.925, and 1.356 (g/hundred grain weight) for BP, SVM, RF, and ELM, respectively. Thus, the SVM method performed better in predicting the yields of maize than the other ML methods. Therefore, it is strongly suggested that the MRBVI calculated from images acquired at different growth stages integrated with advanced ML methods should be used for agricultural- and ecological-related chlorophyll estimation and yield predictions.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-981-16-8862-1_22
Remaining Useful Life Prediction Using Machine Learning Algorithms
  • Jan 1, 2022
  • Malcolm Andrew Madeira + 2 more

Machinery’s Remaining Useful Life (RUL) is an effective instrument for maintenance and performance. As a consequence, expenses are reduced, safety is enhanced, and operations are improved. A comparison of available Machine Learning (ML) methods to anticipate the RUL is presented in this research. The ML models were built and tested using datasets from NASA’s Prognostics Data Repository for turbo fan engine data. The obtained results were then compared to the real outcomes in order to assess the accuracy. To compare prediction accuracy, eleven ML methods were chosen. The various methods were evaluated in order to find the prediction model that best predicted the RUL in terms of number of cycles and also classified using binary classification and multi-class classification. These models are Linear Regression, LASSO Regression, Ridge Regression, Polynomial Regression, Decision Trees, Random Forest (RF), Logistic Regression, K-Nearest Neighbours (KNN), Support Vector Machines (SVM), Gaussian Naïve Bayes (NB) and neural net Multilayer Perceptron (MLP).KeywordsRemaining useful lifeMachine learningPHMPredictive maintenanceData science

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 27
  • 10.3390/ijms21030713
Flexible Data Trimming Improves Performance of Global Machine Learning Methods in Omics-Based Personalized Oncology
  • Jan 22, 2020
  • International Journal of Molecular Sciences
  • Victor Tkachev + 5 more

(1) Background: Machine learning (ML) methods are rarely used for an omics-based prescription of cancer drugs, due to shortage of case histories with clinical outcome supplemented by high-throughput molecular data. This causes overtraining and high vulnerability of most ML methods. Recently, we proposed a hybrid global-local approach to ML termed floating window projective separator (FloWPS) that avoids extrapolation in the feature space. Its core property is data trimming, i.e., sample-specific removal of irrelevant features. (2) Methods: Here, we applied FloWPS to seven popular ML methods, including linear SVM, k nearest neighbors (kNN), random forest (RF), Tikhonov (ridge) regression (RR), binomial naïve Bayes (BNB), adaptive boosting (ADA) and multi-layer perceptron (MLP). (3) Results: We performed computational experiments for 21 high throughput gene expression datasets (41–235 samples per dataset) totally representing 1778 cancer patients with known responses on chemotherapy treatments. FloWPS essentially improved the classifier quality for all global ML methods (SVM, RF, BNB, ADA, MLP), where the area under the receiver-operator curve (ROC AUC) for the treatment response classifiers increased from 0.61–0.88 range to 0.70–0.94. We tested FloWPS-empowered methods for overtraining by interrogating the importance of different features for different ML methods in the same model datasets. (4) Conclusions: We showed that FloWPS increases the correlation of feature importance between the different ML methods, which indicates its robustness to overtraining. For all the datasets tested, the best performance of FloWPS data trimming was observed for the BNB method, which can be valuable for further building of ML classifiers in personalized oncology.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.psj.2024.104489
An investigation of machine learning methods applied to genomic prediction in yellow-feathered broilers
  • Nov 1, 2024
  • Poultry Science
  • Bogong Liu + 6 more

An investigation of machine learning methods applied to genomic prediction in yellow-feathered broilers

  • Research Article
  • Cite Count Icon 11
  • 10.13031/trans.14305
Comparison of Machine Learning Methods for Leaf Nitrogen Estimation in Corn Using Multispectral UAV Images
  • Jan 1, 2021
  • Transactions of the ASABE
  • Razieh Barzin + 2 more

HighlightsLeaf nitrogen percentage in corn was estimated using various vegetation indices derived from UAVs.Eight machine learning methods were compared to find the most accurate model for nitrogen estimation.The most influential vegetation indices were determined for estimation of leaf nitrogen.Abstract. Nitrogen (N) is the most critical component of healthy plants. It has a significant impact on photosynthesis and plant reproduction. Physicochemical characteristics of plants such as leaf N content can be estimated spatially and temporally because of the latest developments in multispectral sensing technology and machine learning (ML) methods. The objective of this study was to use spectral data for leaf N estimation in corn to compare different ML models and find the best-fitted model. Moreover, the performance of vegetation indices (VIs) and spectral wavelengths were compared individually and collectively to determine if combinations of VIs substantially improved the results as compared to the original spectral data. This study was conducted at a Mississippi State University corn field that was divided into 16 plots with four different N treatments (0, 90, 180, and 270 kg ha-1). The bare soil pixels were removed from the multispectral images, and 26 VIs were calculated based on five spectral bands: blue, green, red, red-edge, and near-infrared (NIR). The 26 VIs and five spectral bands obtained from a red-edge multispectral sensor mounted on an unmanned aerial vehicle (UAV) were analyzed to develop ML models for leaf %N estimation of corn. The input variables used in these models had the most impact on chlorophyll and N content and high correlation with leaf N content. Eight ML algorithms (random forest, gradient boosting, support vector machine, multi-layer perceptron, ridge regression, lasso regression, and elastic net) were applied to three different categories of variables. The results show that gradient boosting and random forest were the best-fitted models to estimate leaf %N, with about an 80% coefficient of determination for the different categories of variables. Moreover, adding VIs to the spectral bands improved the results. The combination of SCCCI, NDRE, and red-edge had the largest coefficient of determination (R2) in comparison to the other categories of variables used to predict leaf %N content in corn. Keywords: Corn, Gradient boosting, Machine learning, Multispectral imagery, Nitrogen estimation, Random forest, UAV, Vegetation index.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1007/s00270-024-03776-z
Machine Learning to Predict Prostate Artery Embolization Outcomes
  • Jun 19, 2024
  • CardioVascular and Interventional Radiology
  • G Vigneswaran + 8 more

PurposeThis study leverages pre-procedural data and machine learning (ML) techniques to predict outcomes at one year following prostate artery embolization (PAE).Materials and MethodsThis retrospective analysis combines data from the UK-ROPE registry and patients that underwent PAE at our institution between 2012 and 2023. Traditional ML approaches, including linear regression, lasso regression, ridge regression, decision trees and random forests, were used with leave-one-out cross-validation to predict international prostate symptom score (IPSS) at baseline and change at 1 year. Predictors included age, prostate volume, Qmax (maximum urinary flow rate), post-void residual volume, Abrams-Griffiths number (urodynamics score) and baseline IPSS (for change at 1 year). We also independently confirmed our findings using a separate dataset. An interactive digital user interface was developed to facilitate real-time outcome prediction.ResultsComplete data were available in 128 patients (66.7 ± 6.9 years). All models predicting IPSS demonstrated reasonable performance, with mean absolute error ranging between 4.9–7.3 for baseline IPSS and 5.2–8.2 for change in IPSS. These numbers represent the differences between the patient-reported and model-predicted IPSS scores. Interestingly, the model error in predicting baseline IPSS (based on objective measures alone) significantly correlated with the change in IPSS at 1-year post-PAE (R2 = 0.2, p < 0.001), forming the basis for our digital user interface.ConclusionThis study uses ML methods to predict IPSS improvement at 1 year, integrated into a user-friendly interface for real-time prediction. This tool could be used to counsel patients prior to treatment.

  • Book Chapter
  • 10.1007/978-3-030-35210-3_5
Flexible Data Trimming for Different Machine Learning Methods in Omics-Based Personalized Oncology
  • Jan 1, 2019
  • Victor Tkachev + 2 more

Machine learning (ML) methods are still rarely used for gene expression/mutation-based prediction of individual tumor responses on anticancer chemotherapy due to relatively rare clinical case histories supplemented with high-throughput molecular data. This leads to high vulnerability of most ML methods are to overtraining. Recently, we proposed a novel hybrid global-local approach to ML termed FLOating Window Projective Separator (FloWPS) that avoids extrapolation in the feature space and may improve robustness of classifiers even for datasets with limited number of preceding cases. FloWPS has been validated for the support vector machines (SVM) method, where if significantly improved the quality of classifiers. The core property of FloWPS is data trimming, i.e. sample-specific removal of features. The irrelevant features in a sample that don’t have significant number of neighboring hits in the training dataset are removed from further analyses. In addition, for each point of a validation dataset, only the proximal points of the training dataset are taken into account. Thus, for every point of a validation dataset, the training dataset is adjusted to form a floating window. Here, we applied this approach to seven popular ML methods, including SVM, k nearest neighbors (kNN), random forest (RF), Tikhonov (ridge) regression (RR), binomial naive Bayes (BNB), adaptive boosting (ADA) and multi-layer perceptron (MLP). We performed computational experiments for 21 high throughput clinically annotated gene expression datasets totally including 1778 cancer patients who either responded or not on chemotherapy treatments. The biggest dataset had samples for 235, whereas the smallest for 41 individual cases. For global ML methods, such as SVM, RF, BNB, ADA and MLP, FloWPS essentially improved the classifier quality. Namely, the area under the receiver-operator curve (ROC AUC) for the responder vs non-responder classifier, increased from typical range 0.65–0.85 to 0.80–0.95, respectively. On the other hand, FloWPS was shown useless for purely local ML techniques such as kNN method or RR. However, both these local methods exhibited low sensitivity or specificity in cases when false positive or false negative errors, respectively, should be avoided. According to sensitivity-specificity criterion, for all the datasets tested, the best performance in combination with FloWPS data trimming was shown for the binomial naive Bayesian method, which can be valuable for further development of predictors in personalized oncology.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 25
  • 10.1007/s40745-023-00464-6
Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data
  • Apr 13, 2023
  • Annals of Data Science
  • Bonelwa Sidumo + 2 more

The aim of this study is to investigate the overdispersion problem that is rampant in ecological count data. In order to explore this problem, we consider the most commonly used count regression models: the Poisson, the negative binomial, the zero-inflated Poisson and the zero-inflated negative binomial models. The performance of these count regression models is compared with the four proposed machine learning (ML) regression techniques: random forests, support vector machines, k-\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$k-$$\\end{document}nearest neighbors and artificial neural networks. The mean absolute error was used to compare the performance of count regression models and ML regression models. The results suggest that ML regression models perform better compared to count regression models. The performance shown by ML regression techniques is a motivation for further research in improving methods and applications in ecological studies.

  • Research Article
  • Cite Count Icon 7
  • 10.2139/ssrn.3251077
Developing Theory Using Machine Learning Methods
  • Jan 1, 2018
  • SSRN Electronic Journal
  • Prithwiraj Choudhury + 2 more

We describe how to employ machine learning (ML) methods in theory development. Compared to traditional causal inference methods, ML methods make far fewer a priori assumptions about the functional form of the underlying model that best represents the data. Given this, researchers could use such methods to explore novel and robust patterns in the data that could lead to inductive theory building. ML strengths include replicable identification of novel patterns in the data. Additionally, ML methods address several concerns (such as ‘p-hacking’ and confounding local effects for global effects) raised by scholars relative to the norms of empirical research in the fields of strategy and management. We develop a step-by-step roadmap that illustrates how to use four ML methods (decision trees, random forests, K-nearest neighbors and neural networks) to reveal patterns in data that could be used for theory building. We also illustrate how ML methods could better illuminate interactions and non-linear effects, relative to traditional methods. In summary, ML methods could act as a complementary tool to both existing inductive theory-creating methods such as multiple case inductive studies and traditional methods of causal inference.

  • Conference Article
  • Cite Count Icon 6
  • 10.2118/201379-ms
A Machine Learning Approach to Reduce the Number of Simulations for Long-Term Well Control Optimization
  • Oct 19, 2020
  • Daniel Rodrigues Santos + 3 more

A long-term well control strategy is frequently selected using optimization methods applied to reservoir simulations. However, this approach usually requires a large number of simulations that can be computationally demanding. In this paper, we evaluated several machine learning (ML) techniques to reduce the number of simulations for optimizing long-term well control strategy while preserving the quality of the solution. We proposed a methodology, denoted as IDLHC–ML, which combines many ML techniques with iterative discrete Latin hypercube (IDLHC) – a gradient-free optimization algorithm that was successfully applied in previous work – to optimize the coefficients of the logistic equation that guides the well's bottom-hole pressure along the time horizon. In IDLHC-ML, we used a set of simulation runs from the first iteration to train the initial ML models. From the second iteration onwards, we employed the trained ML models to predict the net present value (NPV) and only a percentage of the scenarios, which were expected to have the best NPV, were then simulated. As we simulated new scenarios, we updated our ML models to further improve predictions. For a fair comparison, we set the same values for the optimization parameters of IDLHC to the IDLHC–ML and, then, we compared the NPV and the number of simulation runs considering different configurations of IDLHC parameters. In this paper, we evaluated a total of twelve ML regression techniques, such as Bayesian Ridge, Random Forest, and stacked ensemble learning, which consists in using the predictions from multiple ML algorithms as input to a second-level learning model. To minimize random effects, we repeatedly applied IDLHC and IDLHC–ML five times in a single reservoir model (nominal optimization). The results showed that, depending on the IDLHC optimization parameters, IDLHC-ML reduced at least 27% of simulations while keeping the equivalent NPV statistical metrics calculated in all five repetitions, when compared to IDLHC. Moreover, the best ML technique for IDLHC–ML varied with the IDLHC set of optimization parameters. To conclude, the method proposed here was able to reduce a significant amount of computational time by curtailing the total number of full-physics expensive reservoir simulations, with the help of fast and low-cost ML models. There are many published studies in well control optimization, but these generally involve high computational demand. In this sense, ML methods revealed to be an adequate and inexpensive alternative in reducing the number of simulation runs in well control optimization. The methodology is generic and it can be applied under uncertainties, and for more complex cases.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.