Abstract
Background Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models.ResultsWe have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed.ConclusionsWe have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826.Electronic supplementary materialThe online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users.
Highlights
Melting point (MP) is an important property in regards to the solubility of chemical compounds
Drug‐like subsets In our previous study we showed that compounds with MP in the range 50–250 °C contributed the majority of compounds in drug-like collections [11]
The training of a model with hundred thousand descriptors is infeasible with computational algorithms, which operate with the full matrix
Summary
Melting point (MP) is an important property in regards to the solubility of chemical compounds. The prediction of physicochemical properties is important in the pharmaceutical industry for structure design and for the purpose of optimizing ADME properties Physicochemical parameters such as logP, pKa, logD, aqueous solubility and many others impact drug-related properties and environmental chemicals such as surfactants, wetting agents and so on [1, 2]. The modeling of these properties is best facilitated by obtaining large, structurally diverse, high-quality datasets. Validating the measured property in any meaningful way is difficult but manual inspection can highlight obvious errors with the parameters as captured (vide infra)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.