Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation.

Désirée Baumann,Knut Baumann

doi:10.1186/s13321-014-0047-1

Désirée Baumann, Knut Baumann

Open Access

https://doi.org/10.1186/s13321-014-0047-1

Copy DOI

Abstract

BackgroundGenerally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection.MethodsSimulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided.ResultsThe prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate.ConclusionsDouble cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set.Electronic supplementary materialThe online version of this article (doi:10.1186/s13321-014-0047-1) contains supplementary material, which is available to authorized users.

Highlights

QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model
Since the main emphasis was on the comparison of MLR and principal component regression (PCR) for different cross-validation techniques, the results of Lasso are only briefly analysed
The composition of the prediction error was first studied by decomposing it into bias and variance terms (ave.bias(ME) and ave.var(ME)Þ as described previously

Summary

Introduction

QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. The challenge is to distinguish between relevant descriptors which directly relate to the biological activity and irrelevant descriptors [2] This requires both an selection step, which is necessary to be able to estimate the prediction error unbiasedly (see below). The terms double cross-validation and nested cross-validation are used synonymously

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Cheminformatics	Publication Date: Nov 26, 2014
Citations: 177	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

Reliable estimation of externally validated prediction errors for QSAR models
Désirée Baumann ... Knut Baumann
Journal of Cheminformatics | VOL. 5
Désirée Baumann, et. al.Désirée Baumann ... Knut Baumann
01 Mar 2013
Journal of Cheminformatics | VOL. 5

Application of multilayered strategy for variable selection in QSAR modeling of PET and SPECT imaging agents as diagnostic agents for Alzheimer’s disease
Priyanka De ... Kunal Roy
Structural Chemistry | VOL. 30
Priyanka De, et. al.Priyanka De ... Kunal Roy
19 Jun 2019
Structural Chemistry | VOL. 30

The “double cross-validation” software tool for MLR QSAR model development
Kunal Roy ... Pravin Ambure
Chemometrics and Intelligent Laboratory Systems | VOL. 159
Kunal Roy, et. al.Kunal Roy ... Pravin Ambure
20 Oct 2016
Chemometrics and Intelligent Laboratory Systems | VOL. 159

Effect of variation in the observations on the prediction uncertainty in crop model simulation: Use ORYZA (v3) as a case study
Xiaoxia Ling ... Jianliang Huang
Ecological Modelling | VOL. 476
Xiaoxia Ling, et. al.Xiaoxia Ling ... Jianliang Huang
07 Dec 2022
Ecological Modelling | VOL. 476

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics