On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Yun Xu,Royston Goodacre

doi:10.1007/s41664-018-0068-2

Yun Xu, Royston Goodacre

Open Access

https://doi.org/10.1007/s41664-018-0068-2

Copy DOI

Abstract

Model validation is the most important part of building a supervised model. For building a model with good generalization performance one must have a sensible data splitting strategy, and this is crucial for model validation. In this study, we conducted a comparative study on various reported data splitting methods. The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes. Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets. Data splitting methods tested included variants of cross-validation, bootstrapping, bootstrapped Latin partition, Kennard-Stone algorithm (K-S) and sample set partitioning based on joint X–Y distances algorithm (SPXY). These methods were employed to split the data into training and validation sets. The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the training/validation procedure used in model construction. The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set. We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets. Such disparity decreased when more samples were available for training/validation, and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used. We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance, suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance. We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance, most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Analysis and Testing	Publication Date: Jul 1, 2018
Citations: 532	License type: open-access

R Discovery Prime

R Discovery Prime

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Abstract

Talk to us

Similar Papers

More From: Journal of Analysis and Testing

Lead the way for us

Similar Papers

Accuracy of Genomic Prediction in Dairy Cattle
Malena Erbe
-
Malena ErbeMalena Erbe
20 Feb 2022
20 Feb 2022

A Novel DNN Training Framework via Data Sampling and Multi-Task Optimization
Boyu Zhang ... Hong Pan
-
Boyu Zhang, et. al.Boyu Zhang ... Hong Pan
01 Jul 2020
01 Jul 2020

Integrating radiomics with the vesical imaging-reporting and data system to predict muscle invasion of bladder cancer.
Wei Wang ... Xiaodong Zhang
Urologic Oncology: Seminars and Original Investigations | VOL. 41
Wei Wang, et. al.Wei Wang ... Xiaodong Zhang
01 Jun 2023
Urologic Oncology: Seminars and Original Investigations | VOL. 41

Prognostic value of tumor deposits and positive lymph node ratio in stage III colorectal cancer: a retrospective cohort study.
Lei Liu ... Juntao Zhu
International journal of surgery (London, England) | VOL. 110
Lei Liu, et. al.Lei Liu ... Juntao Zhu
18 Mar 2024
International journal of surgery (London, England) | VOL. 110

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Abstract

Talk to us

Similar Papers

More From: Journal of Analysis and Testing