Abstract

BackgroundCritical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.Methodology/Principal FindingsWe show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.Conclusions/SignificanceThe findings of the present study have two important practical implications: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

Highlights

  • Microarrays and other high-throughput assaying technologies have generated immense opportunities for discovery spanning the spectrum from basic research to clinical studies [1,2,3]

  • We start with a theoretical analysis that shows how the choice of four specific components of data analysis protocols for molecular signature development and their statistical testing affects the statistical power to detect predictive signal

  • We present a simulation study that demonstrates that depending on choice of the above components even strong signals can fail to be detected with routine sample sizes and that the effects of each component on statistical power are large and compounded

Read more

Summary

Introduction

Microarrays and other high-throughput assaying technologies have generated immense opportunities for discovery spanning the spectrum from basic research to clinical studies [1,2,3]. Developing molecular signatures in particular, is playing an increasingly important role in a variety of research design objectives both in basic and translational studies. Such objectives include, for example, detecting complex and coordinated patterns of transcriptional response to chemotherapeutic agents on cell lines and predicting subsequent patient treatment response on the basis of this information [5], discovery of new drug targets [6], discovery of biomarkers [7], subtyping diseases [8] and personalizing treatments [9]. Essential to developing molecular signatures is assay reproducibility and statistical reproducibility The latter can be directly assessed by tests of statistical significance of the produced signatures. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call