Abstract
The reliability of a QSAR classification model depends on its capacity to achieve confident predictions of new compounds not considered in the building of the model. The results of this external validation process show the applicability domain (AD) of the QSAR model and, therefore, the robustness of the model to predict the property/activity of new molecules. In this paper we propose the use of the rivality and modelability indexes for the study of the characteristics of the datasets to be correctly modeled by a QSAR algorithm and to predict the reliability of the built model to prognosticate the property/activity of new molecules. The calculation of these indexes has a very low computational cost, not requiring the building of a model, thus being good tools for the analysis of the datasets in the first stages of the building of QSAR classification models. In our study, we have selected two benchmark datasets with similar number of molecules but with very different modelability and we have corroborated the capacity of the predictability of the rivality and modelability indexes regarding the classification models built using Support Vector Machine and Random Forest algorithms with 5-fold cross-validation and leave-one-out techniques. The results have shown the excellent ability of both indexes to predict outliers and the applicability domain of the QSAR classification models. In all cases, these values accurately predicted the statistic parameters of the QSAR models generated by the algorithms.
Highlights
Research on the applicability domain (AD) of Quantitative Structure-Activity Relationship (QSAR)models has caught the attention of the Chemometric community in the last years [1,2,3,4,5,6,7,8]
From the publication of the OECD report [9] describing the principles for the validation of QSAR models, several proposals have been published with the aim of determining the AD of QSAR models
In this paper we describe the approach for classification models, the method can be extended to regression problems
Summary
Research on the applicability domain (AD) of Quantitative Structure-Activity Relationship (QSAR)models has caught the attention of the Chemometric community in the last years [1,2,3,4,5,6,7,8]. According to the OECD guiding principles, a QSAR model should have: (i) a defined endpoint,. (ii) an unambiguous algorithm, (iii) a defined domain of applicability, (iv) appropriate measures of goodness-of-fit, robustness, and predictivity, and (v) a mechanistic interpretation where possible [9,10]. Some approaches for defining the AD are based on similarity analysis [12]. Considering the AD as a measurement aimed to estimate the uncertainty in the prediction of a particular molecule based on how similar it is to the molecules used in the building of a model. AD can be defined as a distance measurement between all the molecules used in the building of the model and, the distance between the molecules of the training and external sets
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.