Abstract

BackgroundGenomic technologies are, by their very nature, designed for hypothesis generation. In some cases, the hypotheses that are generated require that genome scientists confirm findings about specific genes or proteins. But one major advantage of high-throughput technology is that global genetic, genomic, transcriptomic, and proteomic behaviors can be observed. Manual confirmation of every statistically significant genomic result is prohibitively expensive. This has led researchers in genomics to adopt the strategy of confirming only a handful of the most statistically significant results, a small subset chosen for biological interest, or a small random subset. But there is no standard approach for selecting and quantitatively evaluating validation targets.ResultsHere we present a new statistical method and approach for statistically validating lists of significant results based on confirming only a small random sample. We apply our statistical method to show that the usual practice of confirming only the most statistically significant results does not statistically validate result lists. We analyze an extensively validated RNA-sequencing experiment to show that confirming a random subset can statistically validate entire lists of significant results. Finally, we analyze multiple publicly available microarray experiments to show that statistically validating random samples can both (i) provide evidence to confirm long gene lists and (ii) save thousands of dollars and hundreds of hours of labor over manual validation of each significant result.ConclusionsFor high-throughput -omics studies, statistical validation is a cost-effective and statistically valid approach to confirming lists of significant results.

Highlights

  • Genomic technologies are, by their very nature, designed for hypothesis generation

  • Non-random validation can not be used to confirm a complete list of significant results Manually confirming only the most significant results (Figure 1) is probably the most common validation strategy in genomic studies

  • The implicit assumption was made that validating the most significant results was sufficient to support the accuracy of a statistical method or an entire list of significant results [21,22]

Read more

Summary

Introduction

The hypotheses that are generated require that genome scientists confirm findings about specific genes or proteins. One major advantage of high-throughput technology is that global genetic, genomic, transcriptomic, and proteomic behaviors can be observed. Manual confirmation of every statistically significant genomic result is prohibitively expensive. Technologies such as microarrays [1] and next-generation sequencing [2] are routinely used to measure thousands or millions of variables for each sample in a study. A much smaller number are manually validated, typically those with the most significant p-values, using an independent validation technology (Figure 1). One goal of manual validation is to confirm specific biological findings. It may be of interest to confirm a specific SNP is associated with a complex phenotype or to

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call