Abstract

BackgroundBased on available biological information, genomic data can often be partitioned into pre-defined sets (e.g. pathways) and subsets within sets. Biologists are often interested in determining whether some pre-defined sets of variables (e.g. genes) are differentially expressed under varying experimental conditions. Several procedures are available in the literature for making such determinations, however, they do not take into account information regarding the subsets within each set. Secondly, variables (e.g. genes) belonging to a set or a subset are potentially correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power and/or inflated false positive rate.ResultsWe introduce a multiple testing-based methodology which makes use of available information regarding biologically relevant subsets within each pre-defined set of variables while exploiting the underlying dependence structure among the variables. Using this methodology, a biologist may not only determine whether a set of variables are differentially expressed between two experimental conditions, but may also test whether specific subsets within a significant set are also significant.ConclusionsThe proposed methodology; (a) is easy to implement, (b) does not require inverting potentially singular covariance matrices, and (c) controls the family wise error rate (FWER) at the desired nominal level, (d) is robust to the underlying distribution and covariance structures. Although for simplicity of exposition, the methodology is described for microarray gene expression data, it is also applicable to any high dimensional data, such as the mRNA seq data, CpG methylation data etc.

Highlights

  • Based on available biological information, genomic data can often be partitioned into pre-defined sets and subsets within sets

  • Identification of differentially expressed individual variables across experimental conditions is of general interest, in recent years there is considerable interest in analyzing sets of variables that belong to some pre-specified biological categories such as signaling pathways and biological functions

  • To understand the robustness of the two methods in terms of family wise error rate (FWER) control, we considered a variety of probability distributions for the gene expression as follows: (1) Multivariate normal distribution, of appropriate dimension, with mean vectors 0 and μ, and covariance matrices 1 and 2, respectively

Read more

Summary

Introduction

Based on available biological information, genomic data can often be partitioned into pre-defined sets (e.g. pathways) and subsets within sets. [4,6,7,8,9,10,11]), namely, “Is a given set of genes differentially expressed between two conditions?” In this category of methods the gene set information is directly used when selecting differentially expressed sets of genes between two experimental conditions and the question it answers has a clear biological meaning. This category of methods is referred to as self-contained methods, which is the focus of this paper

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.