Abstract

Solubility prediction remains a critical challenge in drug development, synthetic route and chemical process design, extraction and crystallisation. Here we report a successful approach to solubility prediction in organic solvents and water using a combination of machine learning (ANN, SVM, RF, ExtraTrees, Bagging and GP) and computational chemistry. Rational interpretation of dissolution process into a numerical problem led to a small set of selected descriptors and subsequent predictions which are independent of the applied machine learning method. These models gave significantly more accurate predictions compared to benchmarked open-access and commercial tools, achieving accuracy close to the expected level of noise in training data (LogS ± 0.7). Finally, they reproduced physicochemical relationship between solubility and molecular properties in different solvents, which led to rational approaches to improve the accuracy of each models.

Highlights

  • Solubility prediction remains a critical challenge in drug development, synthetic route and chemical process design, extraction and crystallisation

  • More recent developments focused on quantitative structure-activity/property relationship (QSAR/QSPR)[24,25], through statistical analysis and machine learning techniques[26,27,28]

  • We report our new approach to general solubility prediction in organic solvents, which has been understudied, and water using machine learning

Read more

Summary

Results and discussion

Two new metrics were created for our evaluation: % of predictions within LogS ± 0.7 and within LogS ± 1.0 of experimental values (%LogS ± 0.7 and %LogS ± 1.0) The former reflects the maximum accuracy of the model based on the available data and the latter the limits of the Solvent–solvent interactions: constant for each solvent. The only exceptions are SVM, which gave notably better % LogS ± 0.7 with Water_set_wide and Acetone_set, and GP with Water_set_narrow These suggested that the overall accuracy of these predictions is less dependent on the machine learning model and is more dependent on the descriptors and data quality.

Method
Methods
Code availability
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call