Abstract

The performance of quantitative structure–activity relationship (QSAR) models largely depends on the relevance of the selected molecular representation used as input data matrices. This work presents a thorough comparative analysis of two main categories of molecular representations (vector space and metric space) for fitting robust machine learning models in QSAR problems. For the assessment of these methods, seven different molecular representations that included RDKit descriptors, five different fingerprints types (MACCS, PubChem, FP2-based, Atom Pair, and ECFP4), and a graph matching approach (non-contiguous atom matching structure similarity; NAMS) in both vector space and metric space, were subjected to state-of-art machine learning methods that included different dimensionality reduction methods (feature selection and linear dimensionality reduction). Five distinct QSAR data sets were used for direct assessment and analysis. Results show that, in general, metric-space and vector-space representations are able to produce equivalent models, but there are significant differences between individual approaches. The NAMS-based similarity approach consistently outperformed most fingerprint representations in model quality, closely followed by Atom Pair fingerprints. To further verify these findings, the metric space-based models were fitted to the same data sets with the closest neighbors removed. These latter results further strengthened the above conclusions. The metric space graph-based approach appeared significantly superior to the other representations, albeit at a significant computational cost.

Highlights

  • In the past 50 years, quantitative structure–activity relationship (QSAR) has become a powerful tool for drug design and discovery

  • Such representations can be divided into two broad categories of methods, namely, vector space and metric space representations [6]

  • A vector space or linear space representation occurs when the set of modeling instances is represented as a vector, with its characteristics measured relative to some reference frame and having a notion of magnitude and direction from the origin

Read more

Summary

Introduction

In the past 50 years, quantitative structure–activity relationship (QSAR) has become a powerful tool for drug design and discovery. The underlying principle in QSAR modeling is the assumption that molecular structure information is sufficient to model and predict biological or pharmacological activity. In QSAR studies, different molecular representations have been used to describe the information encoded in molecular structures so as to predict the quantitative relationships between biological activity (response-variable) and structural information (predictors) [1,2,3,4,5]. The performance of QSAR models for the accurate characterization of biological molecular properties largely depends on the relevance of the selected molecular representation. Such representations can be divided into two broad categories of methods, namely, vector space and metric space representations [6].

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call