Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction.

Anne-Laure Boulesteix,Carolin Strobl

doi:10.1186/1471-2288-9-85

Abstract

BackgroundIn biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.MethodsIn our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure.ResultsWe assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly.ConclusionsThe median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.

Highlights

In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results
In univariate analyses for identifying differentially expressed genes, the multiple testing problem resulting from high dimensionality can be addressed, e.g. by means of approaches based on the false discovery rate [4,5]
Note that the results from different studies are difficult to compare, since they are all based on different evaluation designs and different variable selection approaches

Summary

Introduction

Researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are exposed to this kind of bias It is well-known that almost all published studies present positive research results, as outlined by Kyzas et al [1] for the special case of prostate cancer.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC medical research methodology	Publication Date: Dec 1, 2009
Citations: 98	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC medical research methodology

Lead the way for us

Similar Papers

High-dimensional data: p >> n in mathematical statistics and bio-medical applications
Sara A Van De Geer ... Hans C Van Houwelingen
Bernoulli | VOL. 10
Sara A Van De Geer, et. al.Sara A Van De Geer ... Hans C Van Houwelingen
01 Dec 2004
Bernoulli | VOL. 10

Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples
Harald Binder ... Martin Schumacher
Statistical Applications in Genetics and Molecular Biology | VOL. 7
Harald Binder, et. al.Harald Binder ... Martin Schumacher
14 Jan 2008
Statistical Applications in Genetics and Molecular Biology | VOL. 7

Variable Selection and Parameter Tuning in High-Dimensional Prediction
...
-
, et. al. ...
28 Jan 2010
28 Jan 2010

Sex ratio variation in gastrointestinal nematodes of Svalbard reindeer; density dependence and implications for estimates of species composition
A Stien ... O Halvorsen
Parasitology | VOL. 130
A Stien, et. al.A Stien ... O Halvorsen
13 Dec 2004
Parasitology | VOL. 130

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC medical research methodology