Abstract

SummaryAccurate prediction of the solubility of chemical substances in solvents remains a challenge. The sparsity of high-quality solubility data is recognized as the biggest hurdle in the development of robust data-driven methods for practical use. Nonetheless, the effects of the quality and quantity of data on aqueous solubility predictions have not yet been scrutinized. In this study, the roles of the size and the quality of data sets on the performances of the solubility prediction models are unraveled, and the concepts of actual and observed performances are introduced. In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed. Applying this method on the largest publicly available solubility database and using a consensus machine learning approach, a top-performing solubility prediction model is achieved.

Highlights

  • The solubility of chemical compounds in water is of fundamental interest, besides being a key property in the design, synthesis, performance, and functioning of new chemical motifs for various applications, including but not limited to drugs, paints, coatings, and batteries

  • In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed

  • To develop an accurate solubility prediction model, we focus on the effects of data size and data quality on the prediction performance of ML models

Read more

Summary

Introduction

The solubility of chemical compounds in water is of fundamental interest, besides being a key property in the design, synthesis, performance, and functioning of new chemical motifs for various applications, including but not limited to drugs, paints, coatings, and batteries. Data-driven modeling holds the promise of making solubility predictions in a tiny fraction of a second. A data-driven model development consists of three main steps: collecting and processing train and test data, extracting and selecting key molecular descriptors, and training and testing the model. There has been a burgeon of efforts that apply the above steps for the development of datadriven solubility prediction models. Data-driven solubility prediction models cater for achieving results quickly, they have not yet widely been adopted in the community due to accuracy issues (Jouyban 2009). The factors that affect the performances of prediction models can be basically grouped into four categories (Haghighatlari et al, 2020): the size of data, the quality of data, the relevance of chemical descriptors, and the capability of the algorithm (Figure 1A). The first two pertain to the data and the latter two pertain to the model

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call