Predicting Corrosion Rates of Carbon Steel in Real-World Aquatic Environments Using Machine Learning
Corrosion in aquatic environments cause significant economic losses and structural degradation. This study models the corrosion rate of S235 carbon steel using machine learning (ML) under real-world aquatic conditions. A field dataset from 46 locations along the Ghent–Terneuzen canal was used, encompassing exposure and environmental parameters such as temperature, pH, total dissolved oxygen (HDO%), chlorophyll concentration, oxidation–reduction potential (ORP), total dissolved solids (TDS), chloride concentration, specific conductivity, depth, and salinity. Six ML algorithms, including Light Gradient Boosting Machine (LightGBM), Gradient Boosting Regressor (GBR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), Neural Network (NN), and Categorical Boosting (CatBoost) were benchmarked before and after feature selection. This work demonstrates that environmental feature selection provides substantially greater predictive improvement than model architecture choice: feature selection enhanced all algorithms from poor (R² ≤ 0.14) to strong performance (R² = 0.70–0.80), reduced inter-model variation by 64%, and decreased prediction error by 48% (RMSE) and 74% (MSE). LightGBM achieved the best performance (MSE = 0.003, R² = 0.80). Unexpectedly feature importance analysis identified, salinity, and depth traditionally considered critical factors showed minimal predictive influence, while exposure duration, pH, HDO%, temperature, chlorophyll concentration and ORP dominated corrosion behaviour. These findings emphasize the critical role of environmental parameters and feature selection over model complexity, supporting more efficient corrosion monitoring and management in marine and aquatic environments.
- Research Article
13
- 10.1016/j.fusengdes.2023.113964
- Aug 21, 2023
- Fusion Engineering and Design
Machine learning-based predictions of yield strength for neutron-irradiated ferritic/martensitic steels
- Research Article
14
- 10.3390/w14213509
- Nov 2, 2022
- Water
There is growing tension between high-performance machine-learning (ML) models and explainability within the scientific community. In arsenic modelling, understanding why ML models make certain predictions, for instance, “high arsenic” instead of “low arsenic”, is as important as the prediction accuracy. In response, this study aims to explain model predictions by assessing the relationship between influencing input variables, i.e., pH, turbidity (Turb), total dissolved solids (TDS), and electrical conductivity (Cond), on arsenic mobility. The two main objectives of this study are to: (i) classify arsenic concentrations in multiple water sources using novel boosting algorithms such as natural gradient boosting (NGB), categorical boosting (CATB), and adaptive boosting (ADAB) and compare them with other existing representative boosting algorithms, and (ii) introduce a novel SHapley Additive exPlanation (SHAP) approach for interpreting the performance of ML models. The outcome of this study indicates that the newly introduced boosting algorithms produced efficient performances, which are comparable to the state-of-the-art boosting algorithms and a benchmark random forest model. Interestingly, the extreme gradient boosting (XGB) proved superior over the remaining models in terms of overall and single-class performance metrics measures. Global and local interpretation (using SHAP with XGB) revealed that high pH water is highly correlated with high arsenic water and vice versa. In general, high pH, high Cond, and high TDS were found to be the potential indicators of high arsenic water sources. Conversely, low pH, low Cond, and low TDS were the main indicators of low arsenic water sources. This study provides new insights into the use of ML and explainable methods for arsenic modelling.
- Research Article
30
- 10.1155/er/8022398
- Jan 1, 2024
- International Journal of Energy Research
Sustainable energy management hinges on precise forecasting of renewable energy sources, with a specific focus on solar power. To enhance resource allocation and grid integration, this study introduces an innovative hybrid approach that integrates meteorological data into prediction models for photovoltaic (PV) power generation. A thorough analysis is performed utilizing the Desert Knowledge Australia Solar Centre (DKASC) Hanwha Solar dataset encompassing PV output power and meteorological variables from sensors. The aim is to develop a distinctive hybrid predictive model framework by integrating feature selection techniques with various regression algorithms. This model, referred to as the PV power generation predictive model (PVPGPM), utilizes meteorological data specific to the DKASC. In this study, various feature selection techniques are implemented, including Pearson correlation (PC), variance inflation factor (VIF), mutual information (MI), step forward selection (SFS), backward elimination (BE), recursive feature elimination (RFE), and embedded method (EM), to identify the most influential factors for PV power prediction. Furthermore, a hybrid predictive model integrating multiple regression algorithms is introduced, including linear regression, ridge regression, Least Absolute Shrinkage and Selection Operator (LASSO) regression, Elastic Net, Extra Trees Regressor, random forest regressor, gradient boosting (GB) regressor, eXtreme Gradient Boosting (XGBoost) Regressor, and a hybrid model thereof. Extensive experimentation and evaluation showcase the effectiveness of the proposed approach in achieving high prediction accuracy. Results demonstrate that the hybrid model comprising XGBoost Regressor, Extra Trees Regressor, and GB regressor surpasses other regression algorithms, yielding a minimal root mean square error (RMSE) of 0.108735 and the highest R‐squared (R2) value of 0.996228. The findings underscore the importance of integrating meteorological insights into renewable energy forecasting for sustainable energy planning and management.
- Research Article
53
- 10.1016/j.fuel.2024.131346
- Mar 1, 2024
- Fuel
Enhancing biomass Pyrolysis: Predictive insights from process simulation integrated with interpretable Machine learning models
- Research Article
17
- 10.3390/app142210532
- Nov 15, 2024
- Applied Sciences
Major depressive disorder (MDD) poses a significant challenge in mental healthcare due to difficulties in accurate diagnosis and timely identification. This study explores the potential of machine learning models trained on EEG-based features for depression detection. Six models and six feature selection techniques were compared, highlighting the crucial role of feature selection in enhancing classifier performance. This study investigates the six feature selection methods: Elastic Net, Mutual Information (MI), Chi-Square, Forward Feature Selection with Stochastic Gradient Descent (FFS-SGD), Support Vector Machine-based Recursive Feature Elimination (SVM-RFE), and Minimal-Redundancy-Maximal-Relevance (mRMR). These methods were combined with six diverse classifiers: Logistic Regression, Support Vector Machine (SVM), Random Forest, Extreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), and Light Gradient Boosting Machine (LightGBM). The results demonstrate the substantial impact of feature selection on model performance. SVM-RFE with SVM achieved the highest accuracy (93.54%) and F1 score (95.29%), followed by Logistic Regression with an accuracy of 92.86% and F1 score of 94.84%. Elastic Net also delivered strong results, with SVM and Logistic Regression both achieving 90.47% accuracy. Other feature selection methods yielded lower performance, emphasizing the importance of selecting appropriate feature selection and machine learning algorithms. These findings suggest that careful selection and application of feature selection techniques can significantly enhance the accuracy of EEG-based depression detection.
- Research Article
- 10.1016/j.grets.2025.100323
- Apr 1, 2026
- Green Technologies and Sustainability
Prediction of building occupancy is a very important element of structural health monitoring (SHM), smart building systems, and energy management. In this study, a number of visualization tools are used to analyze the performance of models, which would allow engineers and operations personnel of a system to interpret forecasts and review them to make informed decisions based on the data on the structural health of the building. It also introduces a new machine learning (ML) system, coupled with optimization schemes, to improve the efficiency and accuracy of prediction. This paper employs a set of state-of-the-art ML models, such as Light Gradient Boosting Machine (LightGBM), LightGBM Logistic Regression (LR), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), Gradient Boosting (GBM), and Random Forest (RF), to come up with a model to predict occupancy. The model is applicable in structural health monitoring, smart building applications, and energy management. The proposed POA-LightGBM approach has a high F1-score of training (0.9373) and test (0.9267), respectively. In addition, the model has enhanced AUC–ROC values of 0.9995 and 0.9990 during training and testing stages, respectively, meaning that the model has a high degree of classification. These findings underline the efficiency, effectiveness, and dependability of the POA-LightGBM approach, solving the practical issues of structural health and failure prediction. By implementing the proposed approach, based on the optimization of performance with the help of ML, to enhance the resiliency of infrastructure and energy efficiency of smart buildings, an efficient and effective solution can be provided. The proposed solution would promote energy through streamlined predictive modeling, smart building efficiency, and resilience of infrastructure. • Occupancy prediction aids SHM, smart buildings, and energy management. • Visualization tools help interpret forecasts for data-driven decisions. • ML framework integrates LightGBM, CatBoost, XGBoost, RF, and POA optimizer. • POA-LightGBM achieves 0.9933 test accuracy and high precision. • Model ensures reliable occupancy prediction, boosting resilience and efficiency.
- Research Article
6
- 10.1177/20552076241280126
- Jan 1, 2024
- Digital health
Elderly patients are more likely to suffer from severe ischemic stroke (IS) and have worse outcomes, including death and disability. We aimed to develop and validate predictive models using novel machine learning algorithms for the 3-month mortality in elderly patients with IS admitted to the intensive care unit (ICU). We conducted a retrospective cohort study. Data were extracted from Medical Information Mart for Intensive Care (MIMIC)-IV and International Stroke Perfusion Imaging Registry (INSPIRE) database. Ten machine learning algorithms including Categorical Boosting (CatBoost), Random Forest (RF), Support Vector Machine (SVM), Neural Network (NN), Gradient Boosting Machine (GBM), K-Nearest Neighbors (KNNs), Multi-Layer Perceptron (MLP), Naive Bayes (NB), eXtreme Gradient Boosting (XGBoost) and Logistic Regression (LR) were used to build the models. Performance was measured using area under the curve (AUC) and accuracy. Finally, interpretable machine learning (IML) models presenting as Shapley additive explanation (SHAP) values were applied for mortality risk prediction. A total of 1826 elderly patients with IS admitted to the ICU were included in the analysis, of whom 624 (34.2%) died, and endovascular treatment was performed in 244 patients. After feature selection, a total of eight variables, including minimum Glasgow Coma Scale values, albumin, lactate dehydrogenase, age, alkaline phosphatase, body mass index, platelets, and types of surgery, were finally used for model construction. The AUCs of the CatBoost model were 0.737 in the testing set and 0.709 in the external validation set. The Brier scores in the training set and testing set were 0.12 and 0.21, respectively. The IML of the CatBoost model was performed based on the SHAP value and the Local Interpretable Model-Agnostic Explanations method. The CatBoost model had the best predictive performance for predicting mortality in elderly patients with IS admitted to the ICU. The IML model would further aid in clinical decision-making and timely healthcare services by the early identification of high-risk patients.
- Research Article
54
- 10.1109/access.2022.3181970
- Jan 1, 2022
- IEEE Access
Precision agriculture is a challenging task to achieve. Several studies have been conducted to forecast agricultural yields using machine learning algorithms (MLA), but few studies have used ensemble machine learning algorithms (EMLA). In the current study, we used a dataset generated by a computer simulation program, and meteorological data obtained over 30 years ago from Maine, United States (USA). The primary goal of this research is to increase the forecast accuracy of the best characteristics for overcoming hunger challenges. We designed stacking regression (SR) and cascading regression (CR) with a novel combination of MLA based on the wild blueberry dataset. We used features that indicated the best regulation for wild blueberry agroecosystems. The four feature engineering selection techniques are applied variance inflation factor (VIF), sequential forward feature selection (SFFS), sequential backward elimination feature selection (SBEFS), and extreme gradient boosting based on feature importance (XFI). We applied Bayesian optimization on popular MLA to obtain the best hyperparameters to achieve accurate wild blueberry yield prediction. The SR used a two-layer structure: level-0 contained light gradient boosting machine (LGBM), gradient boost regression (GBR), and extreme gradient boosting (XGBoost); level-1 provided the output prediction using a Ridge. The (CR) topology is the same MLA used in SR, but in a series form that takes the new prediction as a feeder to each MLA and removes the previous prediction in each stage. We assessed many techniques, CR, and SR outcomes regarding the root mean square error (RMSE) and coefficient of determination (R<sup>2</sup>). In the results, the proposed SR showed the best performance 0.984 R<sup>2</sup> and 179.898 RMSE compared with another study that published 0.938 R<sup>2</sup> and 343.026 RMSE on the seven features selected by XFI. The SR achieved the highest 0.985 R<sup>2</sup> on all features and the features that were selected by SBEFS. Our SR outperformed CR, many other techniques, and another study on wild blueberry yield prediction.
- Research Article
4
- 10.1186/s41043-025-01095-8
- Oct 14, 2025
- Journal of Health, Population, and Nutrition
BackgroundMental health challenges are a growing global public health concern, with university students at elevated risk due to academic and social pressures. Although several studies have exmanined mental health among Bangladeshi students, few have integrated conventional statistical analyses with advanced machine learning (ML) approaches. This study aimed to assess the prevalence and factors associated with depression, anxiety, and stress among Bangladeshi university students, and to evaluate the predictive performance of multiple ML models for those outcomes.MethodsA cross-sectional survey was conducted in February 2024 among 1697 students residing in halls at two public universities in Bangladesh: Jahangirnagar University and Patuakhali Science and Technology University. Data on sociodemographic, health, and behavioral factors were collected via structured questionnaires. Mental health outcomes were measured using the validated Bangla version of the Depression, Anxiety, and Stress Scale-21 (DASS-21). Statistical analyses included chi-square tests and binary logistic regression, while seven ML models including, K-Nearest Neighbors (KNN), Random Forest (RF), Gradient Boosting Machine (GBM), Extreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), Logistic Regression (LR), and Support Vector Machine (SVM) were employed to predict mental health outcomes.ResultsThe prevalence of depression, anxiety, and stress was 56.9%, 69.5%, and 32.2%, respectively. Significant associated factors for depression included unfriendly family relationships, enrollment in commerce, and cigarette smoking. Female gender, unfriendly family relationships, academic year, and cigarette smoking were significant factors for stress. No significant factors were identified for anxiety. Among ML models, SVM achieved the highest accuracy for depression prediction (accuracy = 0.5693; precision = 0.7560; log loss = 0.6847), LR for anxiety (accuracy = 0.6948; precision = 0.7881), and CatBoost for stress (accuracy = 0.6706; precision = 0.6454; F1-score = 0.5777; log loss = 0.6284). Feature importance analyses highlighted faculty of study and relation with family as the top predictors. ROC-AUC values indicated moderate discriminatory performance (all ≥ 0.5).ConclusionsIntegrating machine learning with conventional analyses enhances the identification and prediction of factors associated with depression, anxiety, and stress among university students. These findings support the implementation of campus-based mental health screening, accessible counseling, and peer support programs, and highlight the value of data-driven approaches for developing targeted university mental health policies.
- Research Article
66
- 10.1016/j.jobe.2022.104316
- Mar 11, 2022
- Journal of Building Engineering
Buckling and ultimate load prediction models for perforated steel beams using machine learning algorithms
- Research Article
15
- 10.3934/nhm.2023061
- Jan 1, 2023
- Networks and Heterogeneous Media
<abstract><p>A denial-of-service (DoS) attack aims to exhaust the resources of the victim by sending attack packets and ultimately stop the legitimate packets by various techniques. The paper discusses the consequences of distributed denial-of-service (DDoS) attacks in various application areas of Internet of Things (IoT). In this paper, we have analyzed the performance of machine learning(ML)-based classifiers including bagging and boosting techniques for the binary classification of attack traffic. For the analysis, we have used the benchmark CICDDoS2019 dataset which deals with DDoS attacks based on User Datagram Protocol (UDP) and Transmission Control Protocol (TCP) in order to study new kinds of attacks. Since these protocols are widely used for communication in IoT networks, this data has been used for studying DDoS attacks in the IoT domain. Since the data is highly unbalanced, class balancing is done using an ensemble sampling approach comprising random under-sampler and ADAptive SYNthetic (ADASYN) oversampling technique. Feature selection is achieved using two methods, i.e., (a) Pearson correlation coefficient and (b) Extra Tree classifier. Further, performance is evaluated for ML classifiers viz. Random Forest (RF), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, eXtreme Gradient Boosting (XGBoost) and Gradient Boosting (GB) algorithms. It is found that RF has given the best performance with the least training and prediction time. Further, it is found that feature selection using extra trees classifier is more efficient as compared to the Pearson correlation coefficient method in terms of total time required in training and prediction for most classifiers. It is found that RF has given best performance with least time along with feature selection using Pearson correlation coefficient in attack detection.</p></abstract>
- Research Article
26
- 10.1109/access.2023.3346327
- Jan 1, 2024
- IEEE Access
Energy providers and the power grid are severely harmed by electricity theft, which also causes economic and non-technical losses. Energy theft causes a decline in power quality and overall profitability. Smart grids may address the problem of power theft by merging data and energy flow. The analysis of smart grid data helps to find power theft. The prior methods, however, could have done a better job of identifying energy theft. In this research, we presented an active learning-based machine learning model for energy theft identification and classification of a smart grid. The suggested approach is based on the following steps. We use a dataset from the Open Energy Data Initiative (OEDI), an energy research database that gets information from numerous OEDI offices and labs. Next, we pre-process the data and employ machine learning methods like Active Learning (AL) based Random Forests (RFAL), eXtreme Gradient Boosting (XGboostAL), Decision Tree (DTAL), Gradient Boosting (GBAL), K-Nearest Neighbors (KNNAL), Categorical Boosting (CatboostAL) and Light Gradient Boosting Machine (LGBMAL) classifier. Using the smart grid-based energy theft detection dataset, the proposed RFAL model outperforms other competing models and obtains an accuracy of 70.61%. The principles of smart grid tasks streamline decisions and enhance interaction between humans and machines by combining AL with machine learning. The application of this technology in this area has the potential to enhance the accuracy of energy theft detection and electricity-related problems and consequences.
- Research Article
2
- 10.1002/hsr2.70323
- Dec 30, 2024
- Health Science Reports
ABSTRACTBackground and ObjectivesAssessing treatment response in glioblastoma multiforme (GBM) tumors necessitates developing more objective and quantitative approaches. A machine learning‐based approach is presented in this exploratory study for GBM patients' treatment response assessment based on radiomics extracted from magnetic resonance (MR) images.MethodsMR images from 77 GBM patients were acquired at two post‐surgery stages and preprocessed. From these images, 107 radiomics were extracted from the segmented tumoral cavities. The most informative features for training machine learning (ML) classifiers were identified using the Spearman correlation analysis of features retained by the forward sequential and LASSO algorithms. Applied machine learning models included support vector machine (SVM), random forest (RF), K‐nearest neighbors (KNN), AdaBoost, categorical boosting (CatBoost), light gradient boosting machine (LightGBM), extreme gradient boosting (XGBoost), Naïve Bayes (NB) and logistic regression (LR). Ten‐fold cross‐validation was used to validate the models. Statistical analysis was conducted using SPSS version 27; p‐value < 0.05 was considered significant.ResultsThe Naïve Bayes classifier demonstrated the highest performance among the trained models, achieving an AUC (area under the receiver operating characteristic curve) of 0.86 ± 0.13 when trained on the seven features selected by the forward sequential algorithm and an AUC of 0.84 ± 0.14 when trained using the five features chosen by the LASSO algorithm. The second‐best performance was observed with the KNN classifier, which achieved an AUC of 0.80 ± 0.17 when trained on the features selected by the forward sequential algorithm.ConclusionFindings demonstrated that MRI‐based radiomics could be used as distinctive features to train ML models for GBM patients' treatment response assessment. Trained ML classifiers based on these features serve as aiding tools to expedite the quantitative assessment of GBM patients' treatment response besides qualitative evaluations.
- Research Article
1
- 10.1038/s41598-025-22812-7
- Nov 18, 2025
- Scientific Reports
This study developed a Python-based framework to predict the ultimate bearing capacity of shallow foundations on cohesionless soil, employing machine learning (ML) and deep learning (DL) techniques. Utilizing a comprehensive dataset of 116 footing experiments, Eleven ML models (Gaussian Process Regression (GPR), Extreme Gradient Boosting (XGBoost), Gradient Boosting Machine (GBM), Random Forest (RF), Categorical Boosting (CatBoost) etc.) and five DL models (Artificial Neural Network (ANN), Deep Neural Network (DNN), etc.) trained and compared against traditional methods. Input parameters included foundation dimensions and soil properties. Results demonstrated that ML and DL models significantly outperformed traditional equations, achieving higher accuracy. Ensemble methods like GPR, XGBoost, GBM, RF, and CatBoost exhibited superior performance, with a Coefficient of Determination (R2) values above 0.988 and a Mean Absolute Percentage Error (MAPE) below 5.07%. Conversely, traditional methods showed lower accuracy, with R2 values ranging from 0.684 to 0.82 and MAPE exceeding 19.63%. Taylor diagram analysis confirmed the improved performance of ML and DL. Additionally, a SHapley Additive exPlanations (SHAP) analysis highlighted foundation depth and soil friction angle as the most influential parameters, consistent with geotechnical principles.
- Research Article
12
- 10.1080/0951192x.2024.2372252
- Jul 7, 2024
- International Journal of Computer Integrated Manufacturing
This study investigates the use of machine learning models to predict surface roughness (Ra) in milling multi-grade aluminum alloys without prior knowledge of optimal cutting parameters. A diverse milling dataset encompassing material properties and cutting parameters from various aluminum alloy grades was compiled from research articles. Four machine learning algorithms, Extreme Gradient Boosting (XGB), Random Forest (RFR), Catalogical Gradient Boosting (CAT), and Gradient Boosting Regression (GBR), were employed to develop the predictive model. The dataset underwent cleaning, imputation, and outlier removal to ensure data quality. Feature engineering incorporated material properties and cutting parameters for model training. Performance metrics such as RMSE, MAPE, and R2 were used to assess the models’ accuracy. The SHapley Additive exPlanations (SHAP) technique was employed to interpret the models and identify influential features. GBR achieved the highest prediction accuracy with an RMSE of 0.2507 µm, MAPE of 23.36%, and R2 of 0.8709. Thermal conductivity, feed rate, and cutting speed were consistently identified as the most influential factors, although their rankings differed slightly. This study successfully developed a GBR model for effective Ra prediction in aluminum alloy milling, supporting advancements in smart manufacturing by enabling accurate surface quality prediction and data-driven process optimization through machine learning.