Abstract

Rapid advances in machine learning (ML) provide fast, accurate, and widely applicable methods for predicting free radical-mediated organic pollutant reactivity. In this study, the rate constants (logk) of four halogen radicals were predicted using Morgan fingerprint (MF) and Mordred descriptor (MD) in combination with a series of ML models. The findings highlighted that making accurate predictions for various datasets depended on an effective combination of descriptors and algorithms. To further alleviate the challenge of limited sample size, we introduced a data combination strategy that improved prediction accuracy and mitigated overfitting by combining different datasets. The Light Gradient Boosting Machine (LightGBM) with MF and Random Forest (RF) with MD models based on the unified dataset were finally selected as the optimal models. The SHapley Additive exPlanations revealed insights: the MF-LightGBM model successfully captured the influence of electron-withdrawing/donating groups, while autocorrelation, walk count and information content descriptors in the MD-RF model were identified as key features. Furthermore, the important contribution of pH was emphasized. The results of the applicability domain analysis further supported that the developed model can make reliable predictions for query compounds across a broader range. Finally, a practical web application for logk calculations was built.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call