Classification based upon gene expression data: bias and precision of error rates

Ian A Wood,Peter M Visscher,Kerrie L Mengersen

doi:10.1093/bioinformatics/btm117

Abstract

Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean. Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3-5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors. R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Classification based upon gene expression data: bias and precision of error rates

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Journal: Bioinformatics	Publication Date: Mar 28, 2007
Citations: 88

Similar Papers

Intraoperative electrostimulation for awake brain mapping: how many positive interference responses are required for reliability?
Franck-Emmanuel Roux ... Jean-Baptiste Durand
Journal of neurosurgery | VOL. 133
Franck-Emmanuel Roux, et. al.Franck-Emmanuel Roux ... Jean-Baptiste Durand
20 Sep 2019
Journal of neurosurgery | VOL. 133

Paediatric prescribing error – moving towards zero tolerance
N Woodley ... A Teh
Archives of Disease in Childhood | VOL. 97
N Woodley, et. al.N Woodley ... A Teh
01 May 2012
Archives of Disease in Childhood | VOL. 97

Microarrays and Epidemiology: Ensuring the Impact and Accessibility of Research Findings
Melissa A Troester ... Charles M Perou
Cancer Epidemiology, Biomarkers & Prevention | VOL. 18
Melissa A Troester, et. al.Melissa A Troester ... Charles M Perou
01 Jan 2009
Cancer Epidemiology, Biomarkers & Prevention | VOL. 18

Error reduction: academic detailing as a method to reduce incorrect prescriptions.
J Shaw ... E Perks
European journal of clinical pharmacology | VOL. 59
J Shaw, et. al.J Shaw ... E Perks
18 Oct 2003
European journal of clinical pharmacology | VOL. 59

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Classification based upon gene expression data: bias and precision of error rates

Abstract

Talk to us

Similar Papers

More From: Bioinformatics