Abstract

Accurate prediction of lipophilicity—logP—based on molecular structures is a well-established field. Predictions of logP are often used to drive forward drug discovery projects. Driven by the SAMPL7 challenge, in this manuscript we describe the steps that were taken to construct a novel machine learning model that can predict and generalize well. This model is based on the recently described Directed-Message Passing Neural Networks (D-MPNNs). Further enhancements included: both the inclusion of additional datasets from ChEMBL (RMSE improvement of 0.03), and the addition of helper tasks (RMSE improvement of 0.04). To the best of our knowledge, the concept of adding predictions from other models (Simulations Plus logP and logD@pH7.4, respectively) as helper tasks is novel and could be applied in a broader context. The final model that we constructed and used to participate in the challenge ranked 2/17 ranked submissions with an RMSE of 0.66, and an MAE of 0.48 (submission: Chemprop). On other datasets the model also works well, especially retrospectively applied to the SAMPL6 challenge where it would have ranked number one out of all submissions (RMSE of 0.35). Despite the fact that our model works well, we conclude with suggestions that are expected to improve the model even further.

Highlights

  • IntroductionLipophilicity plays a key role in drug discovery, from an indirect role in ADME (absorption, distribution, metabolism, and excretion), to toxicity and potency [1, 2]

  • Lipophilicity plays a key role in drug discovery, from an indirect role in ADME, to toxicity and potency [1, 2]

  • In our first set of experiments (Table 1: Models 1–3) we studied the impact of varying the settings and adding extra descriptors natively available in chemprop

Read more

Summary

Introduction

Lipophilicity plays a key role in drug discovery, from an indirect role in ADME (absorption, distribution, metabolism, and excretion), to toxicity and potency [1, 2]. Machine learning (ML) approaches that predict logP come in two flavors: additive and property-based. Additive models such as XlogP3 [9] assume additivity of logP for different atom (types), while property-based models such as Simulations Plus (hereafter referred to as S+) logP [10] use statistical methods and molecular descriptors. A third flavor of methods, exemplified by COSMOtherm [11] is physics-based rather than the result of an ML approach. It is widely used the calculation speed is usually about an order of magnitude lower than for ML models

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call