Predicting seawater intrusion wedge length in coastal aquifers using hybrid gradient boosting techniques

  • Abstract
  • Highlights & Summary
  • PDF
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Controlling seawater intrusion (SWI) into freshwater aquifers is crucial for preserving water quality in coastal groundwater management. This research evaluates the performance of three machine learning (ML) models: eXtreme Gradient Boosting (BO-XGB), Light Gradient Boosting Machine (BO-LGB), and Categorical Gradient Boosting (BO-CGB) in predicting the SWI wedge length. A database of 345 numerical simulations was compiled from previous research, and Bayesian Optimization (BO) with fivefold cross-validation was used to fine-tune the models. The inputs included abstraction well distance (Xa), abstraction well depth (Ya), recharge well distance (Xr), recharge well depth (Yr), abstraction rate (Qa), artificial recharge rate (Qr), and SWI wedge length (L). Results show that BO-CGB consistently achieved the best performance, with high R2 values (0.996 in training and 0.969 in testing) and low RMSE values (0.439 m in training and 1.327 m in testing). SHapley Additive exPlanations (SHAP) analysis highlighted that Qa and Qr had the most significant impact on SWI wedge length predictions, followed by Xa and Ya. Partial Dependence Plot (PDP) analysis revealed a strong negative correlation between flow variables Qa and Qr and wedge length, while Xr displayed a more complex, non-linear pattern. BO-CGB emerged as the most reliable model for predicting SWI wedge length. To facilitate practical application, an interactive Graphical User Interface (GUI) was developed, enabling users to input variables and receive instant predictions, enhancing the practical usability of the ML models in managing SWI in coastal aquifers.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.1038/s41598-025-12830-w
Explainable ML modeling of saltwater intrusion control with underground barriers in coastal sloping aquifers.
  • Aug 10, 2025
  • Scientific reports
  • Asaad M Armanuos + 2 more

Reliable modeling of saltwater intrusion (SWI) into freshwater aquifers is essential for the sustainable management of coastal groundwater resources and the protection of water quality. This study evaluates the performance of four Bayesian-optimized gradient boosting models in predicting the SWI wedge length ratio (L/La) in coastal sloping aquifers with underground barriers. A dataset of 456 samples was generated through numerical simulations using SEAWAT, incorporating key variables such as bed slope, hydraulic gradient, relative density, relative hydraulic conductivity, barrier wall depth ratio, and distance ratio. The dataset was divided into 70% for training and 30% for testing. Model performance was assessed using both visual and quantitative metrics. Among the models, Light Gradient Boosting (LGB) achieved the highest predictive accuracy, with RMSE values of 0.016 and 0.037 for the training and testing sets, respectively, and the highest coefficient of determination (R²). Stochastic Gradient Boosting (SGB) followed closely, while Categorical Gradient Boosting (CGB) and eXtreme Gradient Boosting (XGB) showed slightly higher error rates. SHapley Additive exPlanations (SHAP) analysis identified relative barrier wall distance and bed slope as the most influential features affecting model predictions. To support practical application, an interactive graphical user interface (GUI) was developed, allowing users to input key variables and easily estimate L/La values. Finally, the best-performing model was validated against the Akrotiri coastal aquifer in Cyprus, a realistic benchmark case derived from numerical simulations. The model's predictions showed strong agreement with reference results, achieving an RMSE of 0.04, thereby confirming its practical applicability. This study highlights the potential of interpretable, optimized ML models to enhance SWI prediction and support informed decision-making in coastal aquifer management.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 16
  • 10.1007/s12145-025-01900-2
Estimating saltwater wedge length in sloping coastal aquifers using explainable machine learning models
  • May 20, 2025
  • Earth Science Informatics
  • Asaad M Armanuos + 1 more

Managing saltwater intrusion (SWI) in coastal aquifers is critical for safeguarding freshwater quality and ensuring sustainable water resources. This study evaluates the performance of eight machine learning (ML) models in predicting the SWI wedge length ratio (L/Lo) in sloping coastal aquifers. The assessed models encompassed linear, bagging, boosting, and advanced gradient boosting-based approaches, enabling a comprehensive comparison of their predictive capabilities. First, a numerical dataset of 450 samples was compiled, incorporating key dimensionless input variables such as relative density, hydraulic conductivity ratio, bed slope, and recharge well properties. The dataset was split into training and testing subsets in a 70:30 ratio, and model hyperparameters were optimized using Bayesian Optimization (BO). A thorough evaluation was conducted to identify the best-performing predictive model. Results showed that the Extreme Gradient Boosting (XGB) model demonstrated superior predictive accuracy compared to all other models, achieving low root-mean-square-error (RMSE) values of 0.0216 during training and 0.0331 during testing, along with high R2 scores of 0.9801 and 0.9586, respectively. The Categorical Gradient Boosting (CGB) model also exhibited strong performance, with RMSE values of 0.0271 (training) and 0.0316 (testing). SHapley Additive exPlanations (SHAP) analysis revealed that the relative recharge well rate was the most influential predictor, followed by recharge well distance and depth. To facilitate practical application, desktop and web-based graphical user interfaces (GUIs) were developed, allowing users to input variables and effortlessly predict L/L₀. This study demonstrates the effectiveness of ML models in predicting SWI in sloping coastal aquifers and provides user-friendly tools for engineers and researchers.

  • Research Article
  • Cite Count Icon 13
  • 10.1038/s41598-025-10990-3
Hydraulic Performance Modeling of Inclined Double Cutoff Walls Beneath Hydraulic Structures Using Optimized Ensemble Machine Learning.
  • Jul 29, 2025
  • Scientific reports
  • Mohamed Kamel Elshaarawy + 2 more

This study investigates the effectiveness of inclined double cutoff walls installed beneath hydraulic structures by employing five machine learning models: Random Forest(RF), Adaptive Boosting(AdaBoost), eXtreme Gradient Boosting(XGBoost), Light Gradient Boosting Machine(LightGBM), and Categorical Boosting (CatBoost). A comprehensive dataset of 630 samples was gathered from previous studies, including key input variables such as the relative distance between the cutoff wall and the structure's apron width (L/B), the inclination angle ratio between downstream and upstream cutoffs (θ2/θ1), the depth ratio of downstream to upstream cutoff walls (d2/d1), and the relative downstream cutoff depth to the permeable layer depth (d2/D). Outputs considered were the relative uplift force (U/Uo), the relative exit hydraulic gradient (iR/iRo), and the relative seepage discharge per unit structure length (q/qo). The dataset was split with a 70:30 ratio for training and testing. Hyperparameter optimization was conducted using Bayesian Optimization (BO) coupled with five-fold cross-validation to enhance model performance. Results showed that the CatBoost model demonstrated superior performance over other models, consistently yielding high R2 values, specifically surpassing 0.95, 0.93, and 0.97 for U/Uo, iR/iRo, and q/qo, respectively, along with low RMSE scores below 0.022, 0.089, and 0.019 for the same variables. A feature importance analysis is conducted using SHapley Additive exPlanations(SHAP) and Partial Dependence Plot (PDP). The analysis revealed that L/B was the most influential predictor for U/Uo and iR/iRo, while d2/D played a crucial role in determining q/qo. Moreover, PDPs illustrated a positive linear relationship between L/B and U/Uo, a V-shaped impact of d2/d1 on iR/iRo and q/qo, and complex nonlinear interactions for θ2/θ1 across all target variables. Furthermore, an interactive Graphical User Interface(GUI) was developed, enabling engineers to efficiently predict output variables and apply model insights in practical scenarios.

  • Research Article
  • 10.3389/fonc.2026.1727595
Machine learning model for predicting malnutrition risk in lung cancer patients after thoracoscopic resection: a multi-center study.
  • Feb 9, 2026
  • Frontiers in oncology
  • Tianfeng Chen + 6 more

Early detection of malnutrition is critical for timely intervention in lung cancer patients undergoing thoracoscopic resection. Existing black-box prediction models lack clinical interpretability, limiting trust and application. The present study was conducted to predict malnutrition risk by establishing an explainable machine learning (ML) model and evaluate the model performance across several sites, so as to develop a web-based application to aid clinical decision-making. A retrospective analysis was conducted on 1, 134 lung cancer patients who underwent thoracoscopic resection at Dongguan People's Hospital between October 2021 and October 2024, consisting of a training set (n = 795) and a testing set (n = 339). Meanwhile, an external validation cohort (n=273) was prospectively enrolled at the Affiliated Hospital of Guangdong Medical University from March to June of 2025. Furthermore, univariate and multivariate analyses were employed to determine the individual risk variables for post-operative malnutrition. This study constructed eight ML models using Gradient Boosting Machine (GBM), Neural Network, Logistic Regression, Extreme Gradient Boosting (XGBoost), Random Forest, K-Nearest Neighbors (KNN), Adaptive Boosting (AdaBoost), and Support Vector Machine (SVM). The performance of the established models was assessed by decision curve analysis (DCA) and receiver operating characteristic (ROC) curves. Meanwhile, feature contributions and visualize model outputs were quantified using the SHapley Additive exPlanations (SHAP) method to enhance clinical interpretability. Consequently, a web-based risk calculator was created to assist in personalized forecasting. Among 1, 407 total patients, post-operative malnutrition incidence was 11.3% (159/1, 407). Multivariate analysis identified seven independent risk factors: albumin (ALB), Nutritional Risk Screening 2002 score, age, intraoperative blood loss, total drainage volume, Basic Activities of Daily Living (BADL) score, and serum potassium (K). The XGBoost model outperformed others, with AUC 0.845 (95% CI: 0.771-0.919) in the testing set and 0.886 (95% CI: 0.841-0.932) in external validation. SHAP analysis clarified the relative importance of risk factors, improving interpretability. The XGBoost-based explainable ML model effectively predicts malnutrition risk in lung cancer patients after thoracoscopic resection. Integrating high predictive performance with interpretability, it supports clinical risk stratification and personalized nutritional interventions to improve post-operative outcomes. A publicly available web-based calculator facilitates easy clinical application.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 21
  • 10.3389/fneur.2023.1185447
Interpretable machine learning for predicting 28-day all-cause in-hospital mortality for hypertensive ischemic or hemorrhagic stroke patients in the ICU: a multi-center retrospective cohort study with internal and external cross-validation
  • Aug 8, 2023
  • Frontiers in Neurology
  • Jian Huang + 9 more

BackgroundTimely and accurate outcome prediction plays a critical role in guiding clinical decisions for hypertensive ischemic or hemorrhagic stroke patients admitted to the ICU. However, interpreting and translating the predictive models into clinical applications are as important as the prediction itself. This study aimed to develop an interpretable machine learning (IML) model that accurately predicts 28-day all-cause mortality in hypertensive ischemic or hemorrhagic stroke patients.MethodsA total of 4,274 hypertensive ischemic or hemorrhagic stroke patients admitted to the ICU in the USA from multicenter cohorts were included in this study to develop and validate the IML model. Five machine learning (ML) models were developed, including artificial neural network (ANN), gradient boosting machine (GBM), eXtreme Gradient Boosting (XGBoost), logistic regression (LR), and support vector machine (SVM), to predict mortality using the MIMIC-IV and eICU-CRD database in the USA. Feature selection was performed using the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm. Model performance was evaluated based on the area under the curve (AUC), accuracy, positive predictive value (PPV), and negative predictive value (NPV). The ML model with the best predictive performance was selected for interpretability analysis. Finally, the SHapley Additive exPlanations (SHAP) method was employed to evaluate the risk of all-cause in-hospital mortality among hypertensive ischemic or hemorrhagic stroke patients admitted to the ICU.ResultsThe XGBoost model demonstrated the best predictive performance, with the AUC values of 0.822, 0.739, and 0.700 in the training, test, and external cohorts, respectively. The analysis of feature importance revealed that age, ethnicity, white blood cell (WBC), hyperlipidemia, mean corpuscular volume (MCV), glucose, pulse oximeter oxygen saturation (SpO2), serum calcium, red blood cell distribution width (RDW), blood urea nitrogen (BUN), and bicarbonate were the 11 most important features. The SHAP plots were employed to interpret the XGBoost model.ConclusionsThe XGBoost model accurately predicted 28-day all-cause in-hospital mortality among hypertensive ischemic or hemorrhagic stroke patients admitted to the ICU. The SHAP method can provide explicit explanations of personalized risk prediction, which can aid physicians in understanding the model.

  • Research Article
  • 10.1016/j.mlwa.2026.100880
Comparing allometric models to machine learning models for aboveground biomass estimation in agroforestry systems in Kenya
  • Jun 1, 2026
  • Machine Learning with Applications
  • Samuel Irungu Kigotho + 5 more

Comparing allometric models to machine learning models for aboveground biomass estimation in agroforestry systems in Kenya

  • Research Article
  • 10.1111/joor.70108
An Interpretable Machine Learning Model Based on MRI Features for Predicting Pain Severity in Temporomandibular Disorders.
  • Nov 18, 2025
  • Journal of oral rehabilitation
  • Chuanfang Xu + 6 more

Chronic pain around the temporomandibular joint (TMJ) and masticatory muscles is a primary symptom of temporomandibular disorders (TMD). However, the clinical significance of magnetic resonance imaging (MRI) features in predicting TMD-related pain remains unclear. This study aimed to develop and interpret machine learning (ML) models based on MRI characteristics for predicting pain severity in patients with TMD. The present retrospective study included 584 patients with TMD between January 2022 and December 2024, yielding a total of 755 TMJ MRI data sets. Pain severity was classified using the visual analogue scale (VAS). Demographic variables (age, sex) and MRI features-including lesion side, disc position, disc morphology, disc signal, disc perforation, bilaminar zone tear, joint space, joint effusion, condylar movement, bony changes and morphology/signal of the lateral pterygoid muscle-were collected. Eleven ML models based on demographic and MRI features were developed: logistic regression (LR), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), adaptive boosting (AdaBoost), gradient boosting classifier (GBC), bagging classifier (BC), extremely randomised trees (ETC), decision tree classifier (DTC) and multilayer perceptron (MLP). Model performance was evaluated using multiple metrics, including the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity and F1 score. Precision-recall (PR) curves and calibration curves were plotted to assess discrimination and model calibration. Decision curve analysis (DCA) was conducted to evaluate the clinical net benefit across a range of threshold probabilities. Model interpretability was enhanced using Shapley Additive Explanations (SHAP), which quantified the contribution of each feature to individual predictions. Feature selection was conducted based on mean SHAP values, and separate LightGBM models were constructed using the Top 3, 5, and 9 most important features, as well as the full-feature set, for performance comparison. The data set was randomly divided into a training set (n = 604) and a test set (n = 151). Among the 11 ML models, the LightGBM model demonstrated the best predictive performance, with an AUC of 0.899, and was therefore identified as the optimal model. SHAP analysis identified age, disc position and condylar movement as the top three contributing features. Feature selection analysis indicated that selecting the top nine SHAP-ranked variables led to the highest diagnostic performance, with an AUC of 0.829. This study developed an interpretable, high-performing MRI-based ML model incorporating SHAP analysis to integrate imaging and clinical features for objective pain assessment, which may help identify high-risk TMD patients and guide personalised treatment strategies.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.mtcomm.2024.108173
Data-driven shear strength prediction of steel reinforced concrete composite shear wall
  • Jan 23, 2024
  • Materials Today Communications
  • Peng Huang + 2 more

Data-driven shear strength prediction of steel reinforced concrete composite shear wall

  • Research Article
  • Cite Count Icon 105
  • 10.1016/j.conbuildmat.2022.129227
Explainable machine learning models for predicting the axial compression capacity of concrete filled steel tubular columns
  • Oct 2, 2022
  • Construction and Building Materials
  • Celal Cakiroglu + 4 more

Explainable machine learning models for predicting the axial compression capacity of concrete filled steel tubular columns

  • Research Article
  • Cite Count Icon 5
  • 10.21926/aeer.2404020
Comparative Analysis of Machine Learning Models and Explainable Artificial Intelligence for Predicting Wastewater Treatment Plant Variables
  • Oct 17, 2024
  • Advances in Environmental and Engineering Research
  • Fuad Bin Nasir + 1 more

Increasing urban wastewater and rigorous discharge regulations pose significant challenges for wastewater treatment plants (WWTP) to meet regulatory compliance while minimizing operational costs. This study explores the application of several machine learning (ML) models specifically, Artificial Neural Networks (ANN), Gradient Boosting Machines (GBM), Random Forests (RF), eXtreme Gradient Boosting (XGBoost), and hybrid RF-GBM models in predicting important WWTP variables such as Biochemical Oxygen Demand (BOD), Total Suspended Solids (TSS), Ammonia (NH₃), and Phosphorus (P). Several feature selection (FS) methods were employed to identify the most influential WWTP variables. To enhance ML models’ interpretability and to understand the impact of variables on prediction, two widely used explainable artificial intelligence (XAI) methods-Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) were investigated in the study. Results derived from FS and XAI methods were compared to explore their reliability. The ML model performance results revealed that ANN, GBM, XGBoost, and RF-GBM have great potential for variable prediction with low error rates and strong correlation coefficients such as R<sup>2</sup> value of 1 on the training set and 0.98 on the test set. The study also revealed that XAI methods identify common influential variables in each model’s prediction. This is a novel attempt to get an overview of both LIME and SHAP explanations on ML models for a WWTP variable prediction.

  • Preprint Article
  • 10.2196/preprints.80719
Machine Learning-Based Predictive Models for Identifying Fetal Growth Restriction in Patients With Early-Onset Preeclampsia: Retrospective Study (Preprint)
  • Jul 17, 2025
  • Ying Zhang + 7 more

BACKGROUND Background: Fetal growth restriction (FGR) is a common and severe complication of early-onset preeclampsia (PE, ≤34 weeks), significantly increasing risks of perinatal mortality and morbidity. Current prediction methods lack both accuracy and clinical interpretability, which may delay interventions. OBJECTIVE Objective: This study aimed to develop and validate machine learning (ML) models to predict FGR in patients with early-onset PE using routinely available clinical parameters. METHODS Methods: We conducted a retrospective study of 711 patients with early-onset PE (n=238 with FGR, n=473 without FGR) from Fujian Maternity and Child Health Hospital (2014-2024). After rigorous variable selection using univariate analysis and LASSO regression, 8 ML algorithms including Logistic Regression (LR), Naive Bayes (NB), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Support Vector Machine (SVM), Gradient Boosting Decision Tree (GBDT), Multilayer Perceptron (MLP) and Elastic Network (EN) were trained on 70% of the data and validated on 30% of the data. Model performance was evaluated using sensitivity, specificity, accuracy, precision, F1-Score, Area Under the Receiver Operating Characteristic (AUROC), Area Under the Precision-Recall Curve (AUPRC) and calibration curves. Meanwhile, multivariate logistic regression was used to evaluate the independent predictive variables of each variable in the prediction model. The Shapley Additive Explanations (SHAP) method provided model interpretability. RESULTS Results: The MLP model demonstrated superior performance with AUROCs of 0.872 (training, n=158 with FGR, 31.8%) and 0.874 (validation, n=214 with FGR, 37.4%) among 8 ML models. Key predictive variables included pre-pregnancy body mass index (BMI), fundal height (FH), anemia, hyperuricemia, urinary microprotein (MAU) and fetal ultrasound biometric ratios (head circumference abdominal circumference ratio (HC/AC), umbilical artery systolic-to-diastolic ratio (UA S/D), umbilical artery blood flow pulsation index (UA PI)). Furthermore, HC/AC, BMI, UA S/D and hyperuricemia were found to be the most influential predictors in ML via SHAP. Consistent with SHAP results, similar to the results of SHAP, this study found that BMI (protective factor, OR=0.905, P=.003), HC/AC (risk factor, OR=2.372, P<.001), anemia (risk factor, OR=1.914, P=.006) and hyperuricemia (risk factor, OR=1.631, P=.028) were independent risk factors for FGR in patients with early-onset PE by multivariate logistic regression analysis. CONCLUSIONS Conclusions: Our MLP-based model accurately predicts FGR in early-onset PE patients using clinically accessible parameters. The integration of ultrasound biometric ratios and maternal biomarkers provides a practical tool for early risk stratification, with SHAP enhancing clinical interpretability for real-world application.

  • Research Article
  • Cite Count Icon 39
  • 10.1016/j.ecoenv.2024.117210
Identifying cardiovascular disease risk in the U.S. population using environmental volatile organic compounds exposure: A machine learning predictive model based on the SHAP methodology
  • Oct 23, 2024
  • Ecotoxicology and Environmental Safety
  • Qingan Fu + 7 more

Identifying cardiovascular disease risk in the U.S. population using environmental volatile organic compounds exposure: A machine learning predictive model based on the SHAP methodology

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1186/s40069-025-00856-3
Explainable Machine Learning Framework with Experimental Validation for Strength Prediction of Magnesium Phosphate Cement
  • Nov 25, 2025
  • International Journal of Concrete Structures and Materials
  • Anxiang Song + 4 more

Magnesium Phosphate Cement (MPC) is recognized as an effective rapid repair material, with compressive strength serving as a key mechanical property indicator for its mortar formulations. Nevertheless, due to MPC's complex composition and formulation, predicting its compressive strength remains a significant challenge. In this study, a comprehensive database was developed, incorporating four key input variables: the magnesium-to-phosphate (M/P) molar ratio, water-to-cement (W/C) mass ratio, sand-to-binder (S/B) weight ratio, and the borax-to-magnesia(B/M) weight ratio. This dataset was used to train and validate eight machine learning models, including the Lightweight Gradient Boosting (LGB) algorithm, Support Vector Machine (SVM), Decision Tree (DT), Extreme Gradient Boosting (XGB), Ridge Regression (RR), Random Forest (RF), Backpropagation Neural Network (BP), and Gradient Boosting (GB) models. The eight machine learning models were evaluated using performance metrics, including Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Correlation Coefficient, and Root Mean Square Error (RMSE), to identify the optimal model, which was then optimized via the Gray Wolf Optimizer (GWO). The most accurate prediction of MPC compressive strength was attained using the XGB model, with the GWO-optimized XGB model showing enhancement in MAPE, MAE, R2, and RMSE by 21.8%, 60.6%, 43.9%, and 55.3% respectively, relative to the unoptimized XGB model. Employing Shapley Additive exPlanations (SHAP) values and Partial Dependence Plots (PDP), this study facilitates the identification of the most influential input variables and quantifies their effects on MPC compressive strength. The optimized model was validated against experimental data, demonstrating robust and conservative prediction behavior. While the model is trained solely to predict compressive strength, its interpretability enables rational insights into how formulation variables influence strength, thereby supporting informed mix design decisions. This framework offers a reliable and transparent computational tool for preemptive strength assessment of MPC and guides the optimization of mechanical performance in structurally demanding applications.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.3389/fpsyg.2024.1392240
Identifying the most crucial factors associated with depression based on interpretable machine learning: a case study from CHARLS.
  • Jul 25, 2024
  • Frontiers in psychology
  • Rulin Li + 3 more

Depression is one of the most common mental illnesses among middle-aged and older adults in China. It is of great importance to find the crucial factors that lead to depression and to effectively control and reduce the risk of depression. Currently, there are limited methods available to accurately predict the risk of depression and identify the crucial factors that influence it. We collected data from 25,586 samples from the harmonized China Health and Retirement Longitudinal Study (CHARLS), and the latest records from 2018 were included in the current cross-sectional analysis. Ninety-three input variables in the survey were considered as potential influential features. Five machine learning (ML) models were utilized, including CatBoost and eXtreme Gradient Boosting (XGBoost), Gradient Boosting decision tree (GBDT), Random Forest (RF), Light Gradient Boosting Machine (LightGBM). The models were compared to the traditional multivariable Linear Regression (LR) model. Simultaneously, SHapley Additive exPlanations (SHAP) were used to identify key influencing factors at the global level and explain individual heterogeneity through instance-level analysis. To explore how different factors are non-linearly associated with the risk of depression, we employed the Accumulated Local Effects (ALE) approach to analyze the identified critical variables while controlling other covariates. CatBoost outperformed other machine learning models in terms of MAE, MSE, MedAE, and R2metrics. The top three crucial factors identified by the SHAP were r4satlife, r4slfmem, and r4shlta, representing life satisfaction, self-reported memory, and health status levels, respectively. This study demonstrates that the CatBoost model is an appropriate choice for predicting depression among middle-aged and older adults in Harmonized CHARLS. The SHAP and ALE interpretable methods have identified crucial factors and the nonlinear relationship with depression, which require the attention of domain experts.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 20
  • 10.1155/2022/8089428
Predicting and Investigating the Permeability Coefficient of Soil with Aided Single Machine Learning Algorithm
  • Jan 1, 2022
  • Complexity
  • Van Quan Tran

The permeability coefficient of soils is an essential measure for designing geotechnical construction. The aim of this paper was to select a highest performance and reliable machine learning (ML) model to predict the permeability coefficient of soil and quantify the feature importance on the predicted value of the soil permeability coefficient with aided machine learning‐based SHapley Additive exPlanations (SHAP) and Partial Dependence Plot 1D (PDP 1D). To acquire this purpose, five single ML algorithms including K‐nearest neighbors (KNN), support vector machine (SVM), light gradient boosting machine (LightGBM), random forest (RF), and gradient boosting (GB) are used to build ML models for predicting the permeability coefficient of soils. Performance criteria for ML models include the coefficient of correlation R 2 , root mean square error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE). The best performance and reliable single ML model for predicting the permeability coefficient of soil for the testing dataset is the gradient boosting (GB) model, which has R 2 = 0.971, RMSE = 0.199 × 10 −11 m/s, MAE = 0.161 × 10 −11 m/s, and MAPE = 0.185%. To identify and quantify the feature importance on the permeability coefficient of soil, sensitivity studies using permutation importance, SHapley Additive exPlanations (SHAP), and Partial Dependence Plot 1D (PDP 1D) are performed with the aided best performance and reliable ML model GB. Plasticity index, density > water content, liquid limit, and plastic limit > clay content > void ratio are the order effects on the predicted value of the permeability coefficient. The plasticity index and density of soil are the first priority soil properties to measure when assessing the permeability coefficient of soil.

Save Icon
Up Arrow
Open/Close
Setting-up Chat
Loading Interface