Predicting Chemical Biodegradability for Sustainable Chemical Manufacturing: A Machine Learning Approach Using 3D Molecular Descriptors
Achieving sustainable cities and promoting responsible consumption require innovative approaches to chemical design and manufacturing. Precise prediction of chemical biodegradability is crucial for evaluating environmental concerns and facilitating the transition towards green chemistry. This study investigates the effectiveness of ten distinct groups of three-dimensional (3D) molecular descriptors for classifying compounds with rapid biodegradability. The Merck molecular force field (MMFF94s) was used to compute descriptors and generate 3D conformations for a dataset of chemical compounds. The dataset underwent rigorous preprocessing, including feature selection, outlier management, and scaling. Support Vector Machines (SVMs) were tested alongside three tree-based ensemble learning algorithms: Extreme Gradient Boosting (XGBoost), Gradient Boosting Machine (GBM), and Random Forest. Bayesian optimization was employed to optimize model hyperparameters and enhance cross-validated Area Under the Receiver Operating Characteristic Curve (AUC). The GETAWAY descriptors, 3D autocorrelation descriptors, and 3D-MoRSE descriptors consistently demonstrated superior performance compared to other descriptors across all machine learning models. An SVM model trained on 3D autocorrelation descriptors achieved the highest prediction accuracy (0.88), sensitivity (0.83), specificity (0.91), F1-score (0.82), Cohen’s Kappa statistic (0.74), and an AUC of 0.93 on an independent test set. Advanced analytical techniques, including Permutation Feature Importance (PFI), SHapley Additive exPlanations(SHAP), and partial dependency plots (PDP) were utilized to identify the most influential 3D autocorrelation descriptors. The findings of this study demonstrate that 3D molecular descriptors, particularly 3D autocorrelations, play a critical role in developing accurate and interpretable models for predicting chemical biodegradability. These models contribute significantly to the advancement of green chemical design and the development of effective regulatory policies that support the objectives of SDG 11 (Sustainable Cities and Communities) and SDG 12 (Responsible Consumption and Production). By fostering sustainable chemical manufacturing practices, we can create healthier and more resilient urban environments while minimizing the environmental impact of human activities.
- Research Article
- 10.1007/s00704-025-05703-9
- Aug 26, 2025
- Theoretical and Applied Climatology
Climate signals, driven by complex interactions and nonlinear relationships, shape weather patterns and long-term trends, complicating the identification of dominant drivers due to collinearity. This study investigates the consistency and uncertainty of machine learning (ML) techniques for feature importance in climate science, comparing SHapley Additive exPlanations (SHAP), Partial Dependence Plots (PDPs), and gain-based feature importance from Extreme Gradient Boosting (XGBoost). SHAP’s integration with Feed Forward Neural Networks (FFNN) and XGBoost is evaluated to assess model-specific uncertainties. Using winter precipitation data from Ohio, USA, as a case study, the relative contributions of global warming (GW) and the Interdecadal Pacific Oscillation (IPO) to precipitation changes are quantified. Results show GW consistently ranks higher than IPO in at least 60% of stations across all methods, with SHAP and PDPs agreeing in 89% of stations. Global SHAP importance from FFNN and XGBoost aligns in 82% of stations, with GW contributing 15% more than IPO on average, though disagreements in 18% of stations highlight model-dependent uncertainties. Temporal analysis using SHAP values indicates a moderate discrepancy in feature importance between FFNN and XGBoost models (Pearson correlation ≈ 0.5), despite their consensus on the increasing dominance of GW in recent decades, contributing to wetter winters. Regression analysis further confirms that GW accounts for approximately 70% of the multi-decadal variability in winter precipitation across Ohio, with PDPs indicating a strong monotonicity (ρ = 0.94) between warming levels and precipitation increase. PDPs visualize marginal effects but struggle with interactions, while gain-based methods tend to favor features with a greater number of effective split points that reduce loss. SHAP, though robust for ranking, varies with the base model. An ensemble framework is proposed, demonstrating the value of combining these ML techniques complementarily to account for uncertainties and enhance interpretability. This study highlights the importance of addressing methodological uncertainties in feature importance rankings to provide robust insights for climate modeling.
- Research Article
- 10.1186/s42836-025-00360-9
- Jan 29, 2026
- Arthroplasty (London, England)
Total joint arthroplasty (TJA) complications necessitate the development of accurate risk prediction models; however, interpretability in machine learning remains a challenge. While Shapley Additive Explanations (SHAP) offers insights at the individual level, partial dependence plots (PDPs) may provide a better understanding at the population level for developing clinical guidelines. This study compared PDPs and SHAP in explaining machine learning-based 30-day complication risk prediction following TJA. We conducted a retrospective cohort study using the American College of Surgeons National Surgical Quality Improvement Program (NSQIP) database (2019-2023), including 517,826 primary TJA cases. Binary classification models (Random Forest, Gradient Boosting) predicted composite 30-day complications based on 20 clinical predictors. A comprehensive interpretability analysis employed directional concordance validation between PDP and SHAP, permutation importance thresholding (5% relative influence), followed by one- and two-dimensional partial dependence analyses with explicit interaction modeling. The cohort comprised 517,826 primary TJA procedures with a complication rate of 6.67%. The baseline Random Forest model achieved test AUC = 0.678. Directional concordance analysis demonstrated 97.8% weighted agreement between PDP trends and SHAP attributions, validating methodological comparison. Threshold analysis identified seven significant features, with interaction effects accounting for 49.9% of total model influence (71.9% among top features). PDPs showed actionable dose-response relationships, including critical thresholds for preoperative hematocrit (< 38%), operative time (> 120min), and complementary interactions, such as age × ASA classification (19.1% importance), operative time × ASA classification (10.1%), and hematocrit × diabetes (6.4%). Comparative patient analysis demonstrated that while SHAP quantified individual contributions, only PDPs provided population thresholds directly translatable to institutional protocols. PDPs appear more methodologically appropriate than SHAP for population-level clinical guideline development, offering actionable dose-response relationships and population risk thresholds that SHAP's individualized attribution framework cannot provide. The dominance of interaction effects among the most influential predictors validates that PDPs accurately capture complementary relationships while presenting them in a format directly applicable to evidence-based perioperative protocols and institutional quality improvement initiatives. Video Abstract.
- Research Article
- 10.21037/tau-2025-350
- Oct 25, 2025
- Translational Andrology and Urology
BackgroundWhile environmental heavy metal exposure has been linked to various metabolic disorders, its association with overactive bladder (OAB) remains poorly characterized. Emerging evidence suggests body mass index (BMI) may mediate heavy metal-induced metabolic dysregulation, though underlying pathways remain unclear. This study investigates the interplay between heavy metal exposure, BMI, and OAB risk via explainable machine learning (ML) and mediation analysis.MethodsDrawing on data from the National Health and Nutrition Examination Survey (NHANES) [2005–2010], we identified OAB-associated heavy metals via least absolute shrinkage and selection operator (LASSO) regression and the Boruta algorithm, then developed ten ML models. The optimal model, Extreme Gradient Boosting (XGBoost), was selected based on performance metrics and interpreted via Permutation Feature Importance (PFI), Shapley Additive Explanations (SHAP), and Partial Dependence Plots (PDP). Dose-response relationships, mixture effects, and BMI-mediated pathways were validated through logistic regression (LR), restricted cubic splines (RCS), Bayesian kernel machine regression (BKMR), and mediation analysis.ResultsAmong 3,201 eligible participants, blood lead, blood iron, urinary barium, urinary cadmium, urinary thallium, and urinary mercury were identified as OAB-associated metals. The XGBoost model achieved superior predictive performance [area under the curve (AUC): 0.736]. PFI highlighted hypertension, urinary cadmium, and age as key OAB determinants, while SHAP emphasized urinary cadmium and blood iron as primary predictors. PDP revealed a positive cadmium-OAB association and an inverse iron-OAB relationship. LR confirmed blood iron [odds ratio (OR) =0.72, 95% confidence interval (CI): 0.57–0.90] and urinary cadmium (OR =1.23, 95% CI: 1.06–1.42) as independent risk factors. RCS demonstrated linear trends for cadmium/iron and nonlinear trends for lead. BKMR analysis confirmed a positive overall mixture effect (conditional posterior inclusion probabilities =0.9860), with urinary cadmium showing the strongest exposure-response relationship. Mediation analysis indicated BMI mediated 14.80% of iron’s protective effect and partially counteracted cadmium/lead risks (mediation proportions: −17.33%).ConclusionsUrinary cadmium, blood lead, and iron emerge as critical OAB risk modulators, with BMI serving as a partial mediator. Integrating explainable ML with conventional epidemiology elucidates environmental-metabolic interactions in OAB pathogenesis, underscoring the need for heavy metal screening and BMI management in high-risk populations.
- Research Article
5
- 10.1016/j.imed.2024.09.005
- Feb 1, 2025
- Intelligent Medicine
Blood pressure abnormality detection and Interpretation utilizing Explainable Artificial Intelligence
- Research Article
5
- 10.1016/j.cscm.2023.e02818
- Dec 22, 2023
- Case Studies in Construction Materials
Multi-output machine learning for predicting the mechanical properties of BFRC
- Research Article
6
- 10.1186/s13321-023-00737-5
- Jul 28, 2023
- Journal of Cheminformatics
Molecular descriptors characterize the biological, physical, and chemical properties of molecules and have long been used for understanding molecular interactions and facilitating materials design. Some of the most robust descriptors are derived from geometrical representations of molecules, called 3-dimensional (3D) descriptors. When calculated from molecular dynamics (MD) simulation trajectories, 3D descriptors can also capture the effects of operating conditions such as temperature or pressure. However, extracting 3D descriptors from MD trajectories is non-trivial, which hinders their wide use by researchers developing advanced quantitative-structure–property-relationship models using machine learning. Here, we describe a suite of open-source Python-based post-processing routines, called PyL3dMD, for calculating 3D descriptors from MD simulations. PyL3dMD is compatible with the popular simulation package LAMMPS and enables users to compute more than 2000 3D molecular descriptors from atomic trajectories generated by MD simulations. PyL3dMD is freely available via GitHub and can be easily installed and used as a highly flexible Python package on all major platforms (Windows, Linux, and macOS). A performance benchmark study used descriptors calculated by PyL3dMD to develop a neural network and the results showed that PyL3dMD is fast and efficient in calculating descriptors for large and complex molecular systems with long simulation durations. PyL3dMD facilitates the calculation of 3D molecular descriptors using MD simulations, making it a valuable tool for cheminformatics studies.Graphical
- Research Article
4
- 10.1038/s41598-025-11601-x
- Jul 20, 2025
- Scientific Reports
The construction sector is proactively working to minimize the environmental impact of cement manufacturing by adopting alternative cementitious substances and cutting carbon emissions tied to concrete. This study investigates the viability of using waste industrial materials as a replacement of cement in concrete mixes. The primary goal is to predict the compressive strength of waste-incorporated concrete by evaluating the effects of materials such as cement, fly ash (FA), silica fume (SF), ground granulated blast furnace slag (GGBFS), metakaolin (MK), water usage, aggregate levels, and superplasticizer dosages. A total of 441 data entries were sourced from various publications. Multiple machine learning techniques, such as light gradient boosting (LGB), extreme gradient boosting (XGB), and decision trees (DT), along with hybrid approaches like XGB-LGB and XGB-DT, were utilized to study how these variables influence compressive strength. The dataset was partitioned into training and testing, and statistical tools were employed to assess the correlation between input variables and strength. Model accuracy was gauged using metrics such as mean absolute percentage error (MAPE), root mean square error (RMSE), and the coefficient of determination (R2). Among the models, the XGB and DT approach delivered the highest precision, with an R2 of 0.928 in the training stage. Among hybrid models, XGB-DT exhibited a balanced performance having R2 value of 0.907 and 0.785 for training and testing phase. Additionally, SHAP (SHapley Additive exPlanations) and partial dependence plots (PDP) were employed to pinpoint the optimal ranges for each variable’s contribution to the improvement of compressive strength. SHAP and PDP analyses identified coarse aggregate, superplasticizers, water and cement content have high influence on model’s output. Additionally, 150–200 kg/m3 of GGBFS as key factors for optimizing compressive strength. The study concludes that the hybrid models along with the single models, can effectively forecast the compressive strength of concrete incorporating industrial byproducts, assisting the construction industry in efficiently evaluating material properties and understanding the influence of various input factors.
- Research Article
10
- 10.1016/j.conbuildmat.2023.132885
- Aug 10, 2023
- Construction and Building Materials
Prediction model of long-term tensile strength of glass fiber reinforced polymer bars exposed to alkaline solution based on Bayesian optimized artificial neural network
- Research Article
11
- 10.1007/s12145-025-01755-7
- Feb 1, 2025
- Earth Science Informatics
Controlling seawater intrusion (SWI) into freshwater aquifers is crucial for preserving water quality in coastal groundwater management. This research evaluates the performance of three machine learning (ML) models: eXtreme Gradient Boosting (BO-XGB), Light Gradient Boosting Machine (BO-LGB), and Categorical Gradient Boosting (BO-CGB) in predicting the SWI wedge length. A database of 345 numerical simulations was compiled from previous research, and Bayesian Optimization (BO) with fivefold cross-validation was used to fine-tune the models. The inputs included abstraction well distance (Xa), abstraction well depth (Ya), recharge well distance (Xr), recharge well depth (Yr), abstraction rate (Qa), artificial recharge rate (Qr), and SWI wedge length (L). Results show that BO-CGB consistently achieved the best performance, with high R2 values (0.996 in training and 0.969 in testing) and low RMSE values (0.439 m in training and 1.327 m in testing). SHapley Additive exPlanations (SHAP) analysis highlighted that Qa and Qr had the most significant impact on SWI wedge length predictions, followed by Xa and Ya. Partial Dependence Plot (PDP) analysis revealed a strong negative correlation between flow variables Qa and Qr and wedge length, while Xr displayed a more complex, non-linear pattern. BO-CGB emerged as the most reliable model for predicting SWI wedge length. To facilitate practical application, an interactive Graphical User Interface (GUI) was developed, enabling users to input variables and receive instant predictions, enhancing the practical usability of the ML models in managing SWI in coastal aquifers.
- Research Article
9
- 10.1016/j.engstruct.2023.116236
- May 18, 2023
- Engineering Structures
Data-driven models for predicting tensile load capacity and failure mode of grouted splice sleeve connection
- Research Article
- 10.1007/s00521-025-11345-9
- Jun 5, 2025
- Neural Computing and Applications
This study investigates hybrid machine learning models combined with wavelet transforms for predicting clean energy market dynamics from 01.04.2014 to 02.05.2024. Models such as support vector regression (SVR), artificial neural networks (ANNs), eXtreme Gradient Boosting (XGBoost), gradient boosting machine (GBM), long short-term memory (LSTM), and convolutional neural network (CNN) are compared to forecast the Nasdaq Clean Edge Green Energy Index (NasdaqClean). Discrete wavelet transform (DWT) and continuous wavelet transform (CWT) are used for feature extraction and visualizations, capturing both short-term fluctuations and long-term trends. Shapley additive explanations (SHAP) and permutation feature importance (PFI) assess feature contributions. Analysis across sub-periods, including the Paris Agreement, COVID-19, and the Russia–Ukraine conflict, reveals that different models perform optimally in different periods. Specifically, Wavelet-SVR emerges as the most accurate model in the entire dataset, before the Paris Agreement and Paris Agreement periods, demonstrating strong predictive power by reducing noise and enhancing feature extraction. LSTM performs best during COVID-19, capturing long-term dependencies and volatile market dynamics. Meanwhile, CNN yields the most accurate predictions during the Russia–Ukraine conflict, effectively identifying spatial patterns in the dataset.
- Research Article
- 10.1186/s40069-025-00856-3
- Nov 25, 2025
- International Journal of Concrete Structures and Materials
Magnesium Phosphate Cement (MPC) is recognized as an effective rapid repair material, with compressive strength serving as a key mechanical property indicator for its mortar formulations. Nevertheless, due to MPC's complex composition and formulation, predicting its compressive strength remains a significant challenge. In this study, a comprehensive database was developed, incorporating four key input variables: the magnesium-to-phosphate (M/P) molar ratio, water-to-cement (W/C) mass ratio, sand-to-binder (S/B) weight ratio, and the borax-to-magnesia(B/M) weight ratio. This dataset was used to train and validate eight machine learning models, including the Lightweight Gradient Boosting (LGB) algorithm, Support Vector Machine (SVM), Decision Tree (DT), Extreme Gradient Boosting (XGB), Ridge Regression (RR), Random Forest (RF), Backpropagation Neural Network (BP), and Gradient Boosting (GB) models. The eight machine learning models were evaluated using performance metrics, including Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Correlation Coefficient, and Root Mean Square Error (RMSE), to identify the optimal model, which was then optimized via the Gray Wolf Optimizer (GWO). The most accurate prediction of MPC compressive strength was attained using the XGB model, with the GWO-optimized XGB model showing enhancement in MAPE, MAE, R2, and RMSE by 21.8%, 60.6%, 43.9%, and 55.3% respectively, relative to the unoptimized XGB model. Employing Shapley Additive exPlanations (SHAP) values and Partial Dependence Plots (PDP), this study facilitates the identification of the most influential input variables and quantifies their effects on MPC compressive strength. The optimized model was validated against experimental data, demonstrating robust and conservative prediction behavior. While the model is trained solely to predict compressive strength, its interpretability enables rational insights into how formulation variables influence strength, thereby supporting informed mix design decisions. This framework offers a reliable and transparent computational tool for preemptive strength assessment of MPC and guides the optimization of mechanical performance in structurally demanding applications.
- Research Article
1
- 10.1186/s40677-025-00341-9
- Oct 28, 2025
- Geoenvironmental Disasters
Background The use of incinerated bottom ash (IBA) as a sustainable construction material offers potential environmental benefits but introduces complex interactions with cement chemistry. Magnesium phosphate cement (MPC), known for its rapid hardening and superior bonding, can be optimized through the controlled incorporation of IBA. However, limited studies have addressed how the chemical components of IBA affect the compressive strength of MPC, particularly using data-driven approaches. Methods A database of 396 experimental samples was compiled from previous studies considering mix proportions, oxide compositions, and curing conditions. Four ensemble machine learning algorithms—Extreme Gradient Boosting (XGB), Light Gradient Boosting (LGB), Gradient Boosting Regressor (GBR), and Random Forest (RFR)—were employed to predict compressive strength. Model robustness was validated through 5-fold cross-validation. Feature interpretation was achieved using SHapley Additive exPlanations (SHAP) and Partial Dependence Plots (PDP) to quantify individual and interactive effects of chemical and physical parameters. Results The XGB model achieved the highest predictive accuracy, with mean training and testing R2 values greater than 0.90 and 0.80, and the lowest mean absolute percentage error of 16.71%. SHAP analysis identified curing age as the most dominant factor, followed by FA/C, W/C, and MgO/PO4 ratios. IBA content and specific oxides such as Fe2O3 and Al2O3 contributed positively to strength within optimal ranges. PDP confirmed nonlinear dependencies, indicating a 26% reduction in strength as W/C increased from 0.1 to 0.6, while extended curing up to 28 days improved performance substantially. Conclusion The integration of SHAP and PDP provided a transparent interpretation of feature interactions in IBA-modified MPC. The developed XGB model demonstrated strong generalization and interpretability. The combined modeling approach offers a reliable predictive framework for optimizing IBA incorporation in sustainable binder systems and advancing eco-efficient material design.
- Research Article
- 10.1002/for.3254
- Jan 14, 2025
- Journal of Forecasting
ABSTRACTExisting studies mainly focus on short‐term economic forecasts, but research on long‐term projections, particularly for periods spanning 6–10 years, remains insufficient, despite its importance. This gap may arise from the limitations of traditional linear methods in prediction tasks and pattern recognition, whereas machine learning techniques may help overcome these challenges. To address this, we employ five widely used machine learning models—artificial neural networks (ANN), random forest regression (RF), gradient boosting regression (GBR), extreme gradient boosting (XGBoost), and support vector regression (SVR)—using cross‐country data from 109 countries between 1961 and 2019. To ensure robustness, we employ two distinct sampling methods for model validation. Our findings reveal that the ANN model outperforms others, particularly in long‐term predictions (6–10 years), with an average out‐of‐sample prediction ‐squared of 0.89. Furthermore, analyses using permutation feature importance (PFI) and SHapley Additive exPlanations (SHAP) methods indicate that while current growth rates are critical for short‐term forecasts (1–3 years), two primary variables representing a country's foundational characteristics—real GDP per capita and “country‐feature,” akin to a country dummy variable—are crucial for long‐term predictions (4–10 years). This outcome demonstrates the ANN model's capacity to capture each country's unique characteristics and, through its highly non‐linear nature, successfully execute complex, long‐range forecasts. These results unveil the remarkable potential of machine learning in the realm of long‐term economic forecasting.
- Research Article
- 10.1016/j.pnpbp.2025.111473
- Aug 30, 2025
- Progress in neuro-psychopharmacology & biological psychiatry
Dual role of Interleukin-18 in linking metabolic and psychiatric symptoms: Insights from machine learning in schizophrenia.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.