Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups

Jiexin Zhang,Kevin R Coombes

doi:10.1186/1471-2105-13-s13-s1

Jiexin Zhang, Kevin R Coombes

Open Access

https://doi.org/10.1186/1471-2105-13-s13-s1

Copy DOI

Abstract

BackgroundHigh-throughtput technologies enable the testing of tens of thousands of measurements simultaneously. Identification of genes that are differentially expressed or associated with clinical outcomes invokes the multiple testing problem. False Discovery Rate (FDR) control is a statistical method used to correct for multiple comparisons for independent or weakly dependent test statistics. Although FDR control is frequently applied to microarray data analysis, gene expression is usually correlated, which might lead to inaccurate estimates. In this paper, we evaluate the accuracy of FDR estimation.MethodsUsing two real data sets, we resampled subgroups of patients and recalculated statistics of interest to illustrate the imprecision of FDR estimation. Next, we generated many simulated data sets with block correlation structures and realistic noise parameters, using the Ultimate Microarray Prediction, Inference, and Reality Engine (UMPIRE) R package. We estimated FDR using a beta-uniform mixture (BUM) model, and examined the variation in FDR estimation.ResultsThe three major sources of variation in FDR estimation are the sample size, correlations among genes, and the true proportion of differentially expressed genes (DEGs). The sample size and proportion of DEGs affect both magnitude and precision of FDR estimation, while the correlation structure mainly affects the variation of the estimated parameters.ConclusionsWe have decomposed various factors that affect FDR estimation, and illustrated the direction and extent of the impact. We found that the proportion of DEGs has a significant impact on FDR; this factor might have been overlooked in previous studies and deserves more thought when controlling FDR.

Highlights

With the advent of high throughput technologies, research has focused on the systematic genome-wide study of biological systems
These methods share certain characteristics: they perform a separate statistical test for each gene or protein; they compute a p-value associated with each test; and they estimate the False Discovery Rate (FDR) using the distribution of p-values
Through a comprehensive set of simulations, we show that sample size, correlation structure and portion of differentially expressed genes (DEGs) account for the majority of observed variability in the p-value distributions and FDR estimates found in real data

Summary

Introduction

With the advent of high throughput technologies, research has focused on the systematic genome-wide study of biological systems. New statistical methods have been developed to analyze the data generated by these experiments These methods involve both data preprocessing (background correction, data transformation, normalization, etc.) and specific tools for different types of studies (e.g., class discovery, class prediction, or class comparison). The canonical class comparison problem involves the identification of lists of DEGs. The evolving consensus [1] on the analysis of microarray data recognizes the centrality of methods that estimate the FDR associated with gene lists. False Discovery Rate (FDR) control is a statistical method used to correct for multiple comparisons for independent or weakly dependent test statistics. FDR control is frequently applied to microarray data analysis, gene expression is usually correlated, which might lead to inaccurate estimates. We evaluate the accuracy of FDR estimation

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 1, 2012
Citations: 42	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Stratified false discovery control for large‐scale hypothesis testing with application to genome‐wide association studies
Lei Sun ... Shelley B Bull
Genetic Epidemiology | VOL. 30
Lei Sun, et. al.Lei Sun ... Shelley B Bull
23 Jun 2006
Genetic Epidemiology | VOL. 30

Evaluating FDR and stratified FDR control approaches for high-throughput biological studies
Jinfeng Zou ... Guini Hong
-
Jinfeng Zou, et. al.Jinfeng Zou ... Guini Hong
01 Jun 2012
01 Jun 2012

Adaptive and Dynamic Adaptive Procedures for False Discovery Rate Control and Estimation
Kun Liang ... Dan Nettleton
Journal of the Royal Statistical Society Series B: Statistical Methodology | VOL. 74
Kun Liang, et. al.Kun Liang ... Dan Nettleton
03 Nov 2011
Journal of the Royal Statistical Society Series B: Statistical Methodology | VOL. 74

Local and covariate-modulated false discovery rates applied in neuroimaging
Glenn Lawyer ... Ingrid Agartz
NeuroImage | VOL. 47
Glenn Lawyer, et. al.Glenn Lawyer ... Ingrid Agartz
31 Mar 2009
NeuroImage | VOL. 47

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics