Reliable estimation of externally validated prediction errors for QSAR models

Désirée Baumann,Knut Baumann

doi:10.1186/1758-2946-5-s1-p33

Désirée Baumann, Knut Baumann

Open Access

https://doi.org/10.1186/1758-2946-5-s1-p33

Copy DOI

Abstract

In most cases of QSAR modelling the final model used to make predictions, is not known a priori but has to be selected in a data driven fashion (e.g. selection of principal components, variable selection, selection of the best mathematical modelling technique). Reliable estimation of externally validated prediction errors under this model uncertainty is still a challenge in chemoinformatics. To fulfil the standards of external validation, the test data set has to be independent not only from model building but also from model selection. There still is a controversy in the literature how the independent test data set should be chosen and how large it should be. For setting aside a test data set there are basically two different options: 1) a single test data set is set aside and 2) the test data are generated by repeatedly partitioning the available data into test and training set partitions - i.e. cross-validation. Since cross-validation uses the data more efficiently, it is to be preferred in particular for small data sets. The aforementioned cross-validation step must not be confused with a cross-validation step that might be necessary to select the model! If model selection is also done by cross-validation two loops of cross-validation are necessary [1]. In the inner loop, cross-validation is employed for model selection [2] (also referred to as internal validation) while in the outer loop of cross-validation different test data sets are generated repeatedly that are used to assess the readily selected models (external validation). In this contribution double cross-validation is evaluated for its ability to estimate prediction errors under model uncertainty. Depending on how double cross-validation is parameterized (test set size, number of repetitions), it either yields biased or highly variable estimates of the prediction error. The sources of bias and variability will be highlighted and recommendations are provided how to determine the test set size in order to obtain a favourable bias-variability trade-off.

Highlights

In most cases of QSAR modelling the final model used to make predictions, is not known a priori but has to be selected in a data driven fashion
The aforementioned cross-validation step must not be confused with a cross-validation step that might be necessary to select the model! If model selection is done by cross-validation two loops of cross-validation are necessary [1]
Cross-validation is employed for model selection [2] while in the outer loop of cross-validation different test data sets are generated repeatedly that are used to assess the readily selected models

Summary

Introduction

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Cheminformatics	Publication Date: Mar 1, 2013
Citations: 1	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Reliable estimation of externally validated prediction errors for QSAR models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation.
Désirée Baumann ... Knut Baumann
Journal of Cheminformatics | VOL. 6
Désirée Baumann, et. al.Désirée Baumann ... Knut Baumann
26 Nov 2014
Journal of Cheminformatics | VOL. 6

Evaluating the robustness of models developed from field spectral data in predicting African grass foliar nitrogen concentration using WorldView-2 image as an independent test dataset
Onisimo Mutanga ... Elfatih M Abdel-Rahman
International Journal of Applied Earth Observation and Geoinformation | VOL. 34
Onisimo Mutanga, et. al.Onisimo Mutanga ... Elfatih M Abdel-Rahman
06 Sep 2014
International Journal of Applied Earth Observation and Geoinformation | VOL. 34

Cross-validation is dead. Long live cross-validation! Model validation based on resampling
Knut Baumann
Journal of Cheminformatics | VOL. 2
Knut BaumannKnut Baumann
01 May 2010
Journal of Cheminformatics | VOL. 2

Renal tumor segmentation, visualization, and segmentation confidence using ensembles of neural networks in patients undergoing surgical resection.
Sophie Bachanek ... Tanja Yani Janssen
European radiology | VOL. -
Sophie Bachanek, et. al.Sophie Bachanek ... Tanja Yani Janssen
23 Aug 2024
European radiology | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Reliable estimation of externally validated prediction errors for QSAR models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics