Green solvents, catalysts, functional materials, drugs, and other chemical products now have a quick engine for design thanks to machine learning (ML) based prediction of molecular properties. However, the accuracy and stability of ML-based models can be impeded by poor data quality, which is rarely studied in chemical product discovery and design. Inspired by the dynamic ensemble selection (DES), an improved DES based on chemical space deconstruction is proposed in this work to accommodate the prediction task of molecular properties. We innovatively developed a chemical space representation and deconstruction model based on a self-organizing mapping (SOM) neural network, facilitating the rapid implementation of the improved DES on molecular samples. Consequently, a novel dynamic model ensemble architecture (SOM-DES) is proposed as a model enhancement technology to build a more accurate and stable ensemble model, aiming to improve the predictive performance on the chemical subspace within poor-quality data. To achieve the architecture, a supervised dimensionality reduction algorithm has been improved to enhance the deep mining of molecular feature information for DES optimization. Additionally, a novel resampling strategy based on the combination of the geometric synthetic minority oversampling technique (G-SMOTE) algorithm and chemical space deconstruction, as a data augmentation technology, has been proposed for mitigating the disadvantage of unbalanced data during DES training. The prediction task for enthalpy of formation of the ideal gas is applied as a case study to demonstrate the superiority of the proposed SOM-DES. The results indicate that the proposed SOM-DES (R2 = 0.9731, RMSE = 55.4639) outperforms the traditional static ensemble strategy (SES, R2 = 0.9552, RMSE = 71.5045) in terms of global chemical spatial precision. More importantly, for chemical subspaces that are difficult to predict due to low data quality, SOM-DES shows a significant reduction in prediction errors compared to SES.
Read full abstract