Pushing the limits of solubility prediction via quality-oriented data selection.

Murat Cihan Sorkun,J.M. Vianney A. Koelman,Süleyman Er

doi:10.1016/j.isci.2020.101961

Murat Cihan Sorkun, J.M. Vianney A. Koelman + Show 1 more

Open Access

https://doi.org/10.1016/j.isci.2020.101961

Copy DOI

Abstract

SummaryAccurate prediction of the solubility of chemical substances in solvents remains a challenge. The sparsity of high-quality solubility data is recognized as the biggest hurdle in the development of robust data-driven methods for practical use. Nonetheless, the effects of the quality and quantity of data on aqueous solubility predictions have not yet been scrutinized. In this study, the roles of the size and the quality of data sets on the performances of the solubility prediction models are unraveled, and the concepts of actual and observed performances are introduced. In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed. Applying this method on the largest publicly available solubility database and using a consensus machine learning approach, a top-performing solubility prediction model is achieved.

Highlights

The solubility of chemical compounds in water is of fundamental interest, besides being a key property in the design, synthesis, performance, and functioning of new chemical motifs for various applications, including but not limited to drugs, paints, coatings, and batteries
In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed
To develop an accurate solubility prediction model, we focus on the effects of data size and data quality on the prediction performance of ML models

Summary

Introduction

The solubility of chemical compounds in water is of fundamental interest, besides being a key property in the design, synthesis, performance, and functioning of new chemical motifs for various applications, including but not limited to drugs, paints, coatings, and batteries. Data-driven modeling holds the promise of making solubility predictions in a tiny fraction of a second. A data-driven model development consists of three main steps: collecting and processing train and test data, extracting and selecting key molecular descriptors, and training and testing the model. There has been a burgeon of efforts that apply the above steps for the development of datadriven solubility prediction models. Data-driven solubility prediction models cater for achieving results quickly, they have not yet widely been adopted in the community due to accuracy issues (Jouyban 2009). The factors that affect the performances of prediction models can be basically grouped into four categories (Haghighatlari et al, 2020): the size of data, the quality of data, the relevance of chemical descriptors, and the capability of the algorithm (Figure 1A). The first two pertain to the data and the latter two pertain to the model

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: iScience	Publication Date: Dec 17, 2020
Citations: 33	License type: cc-by

R Discovery Prime

R Discovery Prime

Pushing the limits of solubility prediction via quality-oriented data selection.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: iScience

Lead the way for us

Similar Papers

Algorithmic fairness in computational medicine.
Jie Xu ... Jiang Bian
eBioMedicine | VOL. 84
Jie Xu, et. al.Jie Xu ... Jiang Bian
06 Sep 2022
eBioMedicine | VOL. 84

Solution-Mediated Phase Transformation: Significance During Dissolution and Implications for Bioavailability
Kristyn Greco ... Robin Bogner
Journal of Pharmaceutical Sciences | VOL. 101
Kristyn Greco, et. al.Kristyn Greco ... Robin Bogner
01 Sep 2012
Journal of Pharmaceutical Sciences | VOL. 101

How to Avoid Premature Decay of Your Macromolecular Crystal: A Quick Soak for Long Life
Brice Kauffmann ... Andrea Schmidt
Structure | VOL. 14
Brice Kauffmann, et. al.Brice Kauffmann ... Andrea Schmidt
01 Jul 2006
Structure | VOL. 14

Integrated data-driven modeling and experimental optimization of granular hydrogel matrices
Connor A Verheyen ... Jennifer A Lewis
Matter | VOL. 6
Connor A Verheyen, et. al.Connor A Verheyen ... Jennifer A Lewis
31 Jan 2023
Matter | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pushing the limits of solubility prediction via quality-oriented data selection.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: iScience