Abstract

Binding prediction between targets and drug-like compounds through deep neural networks has generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database, and (4) splitting based both in the clustering and in the source database. These schemas are applied to a deep learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our deep learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compound clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.

Highlights

  • Proteochemometrics or quantitative multi-structure-property-relationship modeling (QMSPR) is an extension from the traditional quantitative structure-activity relationship (QSAR) modeling.[1]

  • The main advantages over QSAR are twofold: first, that the induced model can be applied for predictions of interaction with new proteins as well as ligands and second, that it can consider the underlying biological information carried by the protein as well as other possible cross-interactions of the ligand

  • In intermediate we only find overlapping targets between training and validation, since both splitting sets are built from the ChEMBL dataset, while test corresponds to the Maximally Unbiased Validation (MUV) database

Read more

Summary

Introduction

Proteochemometrics or quantitative multi-structure-property-relationship modeling (QMSPR) is an extension from the traditional quantitative structure-activity relationship (QSAR) modeling.[1] In QSAR, the target protein is fixed and its interaction with ligands (small molecules or compounds) is predicted only from ligands descriptors. The aim of proteochemometrics is to predict the binding a nity value by modeling the interaction of both proteins and ligands.[1] For this, a data matrix is built, each of its rows containing descriptors of both target and ligand linked to some experimentally measured biological activity. The main advantages over QSAR are twofold: first, that the induced model can be applied for predictions of interaction with new proteins as well as ligands and second, that it can consider the underlying biological information carried by the protein as well as other possible cross-interactions of the ligand.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call