Abstract

BackgroundModern high-throughput measurement technologies such as DNA microarrays and next generation sequencers produce extensive datasets. With large datasets the emphasis has been moving from traditional statistical tests to new data mining methods that are capable of detecting complex patterns, such as clusters, regulatory networks, or time series periodicity. Study of periodic gene expression is an interesting research question that also is a good example of challenges involved in the analysis of high-throughput data in general. Unlike for classical statistical tests, the distribution of test statistic for data mining methods cannot be derived analytically.ResultsWe describe the randomization based approach to significance testing, and show how it can be applied to detect periodically expressed genes. We present four randomization methods, three of which have previously been used for gene cycle data. We propose a new method for testing significance of periodicity in gene expression short time series data, such as from gene cycle and circadian clock studies. We argue that the underlying assumptions behind existing significance testing approaches are problematic and some of them unrealistic. We analyze the theoretical properties of the existing and proposed methods, showing how our method can be robustly used to detect genes with exceptionally high periodicity. We also demonstrate the large differences in the number of significant results depending on the chosen randomization methods and parameters of the testing framework.By reanalyzing gene cycle data from various sources, we show how previous estimates on the number of gene cycle controlled genes are not supported by the data. Our randomization approach combined with widely adopted Benjamini-Hochberg multiple testing method yields better predictive power and produces more accurate null distributions than previous methods.ConclusionsExisting methods for testing significance of periodic gene expression patterns are simplistic and optimistic. Our testing framework allows strict levels of statistical significance with more realistic underlying assumptions, without losing predictive power. As DNA microarrays have now become mainstream and new high-throughput methods are rapidly being adopted, we argue that not only there will be need for data mining methods capable of coping with immense datasets, but there will also be need for solid methods for significance testing.

Highlights

  • Modern high-throughput measurement technologies such as DNA microarrays and generation sequencers produce extensive datasets

  • Randomization methods are techniques for significance testing that are based on generating data that shares some of the same properties with the real data, but lacks the structure of interest

  • When gene expression data is mined for complex patterns, little if no effort is made to test results for statistical significance

Read more

Summary

Introduction

Modern high-throughput measurement technologies such as DNA microarrays and generation sequencers produce extensive datasets. Study of periodic gene expression is an interesting research question that is a good example of challenges involved in the analysis of high-throughput data in general. A randomization method is based (explicitly or implicitly) on a null model, i.e., a description of what the data would look like in the absence of the pattern of interest. The null model states that the data looks like the original data, except that the target variable is random (but has the same distribution of values as the original one). Using the null model to maintain the number of 1s in the columns and rows in significance testing tells whether the data analysis result is caused just by the row and column sums, i.e., the count of differential expression values for genes and samples

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.