Abstract

BackgroundWhen analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated.ResultsWe show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice.ConclusionStratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.

Highlights

  • When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases

  • Analysis of bias in error rate and area under the ROC (AUC) estimation CV uses sampling without replacement to partition the dataset into training and test sets, any deviation from the class proportions of the whole dataset in a training set leads to an opposite deviation in the corresponding test set

  • We showed that common samplereuse validation schemes such as CV and bootstrap can lead to large pessimistic biases due to correlated class proportions between training and test sets

Read more

Summary

Introduction

When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. Ambroise and McLachlan [8] and Simon et al [1] demonstrated an optimistic selection bias that occurs when gene selection is done using the entire dataset rather than separately for each resampled training set. This bias arises through incorporation of information from the test sets into the training of the classifier. Varma and Simon [9] demonstrated in a simulation study the optimistic hyperparameter selection bias [10] which occurs when reporting the best error rates achieved on the validation set used to tune classifier (hyper)parameters, rather than using a nested CV or separate test set to evaluate the classifier

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call