Abstract

Modern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns. We propose and evaluate two methods to determine the proportion of variance of an output data set that can be explained by an input data set when both data panels are high dimensional. Our approach uses random-effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on an orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. Using simulations, we show that the MC-ANOVA method gave nearly unbiased estimates. Estimates produced by Eigen-ANOVA were also nearly unbiased, except when the shared variance was very high (e.g., >0.9). We demonstrate the potential insight that can be obtained from the use of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken (Gallus gallus) genomes and to the assessment of inter-dependencies between gene expression, methylation, and copy-number-variants in data from breast cancer tumors from humans (Homo sapiens). Our analyses reveal that in chicken breeding populations ~50,000 evenly-spaced SNPs are enough to fully capture the span of whole-genome-sequencing genomes. In the study of multi-omic breast cancer data, we found that the span of copy-number-variants can be fully explained using either methylation or gene expression data and that roughly 74% of the variance in gene expression can be predicted from methylation data.

Highlights

  • Modern genomic data often combine information from multiple data-layers, each of which itself can be high-dimensional

  • The methods proposed in this study can be used to answer ubiquitous questions such as: How much of the inter-individual differences in whole-genome sequence genotypes can be predicted using a low-density SNP array? What proportion of variance in gene expression (GE) can be explained by differences in DNA methylation (ME)? How much of the variance in image-derived phenotypes can be predicted from DNA genotypes?

  • We developed two methods that use random-effects models to estimate the proportion of variance of independent vectors in the linear span of an output layer that can be explained by regression on an input layer

Read more

Summary

Introduction

Modern genomic data often combine information from multiple data-layers, each of which itself can be high-dimensional. The methods proposed in this study can be used to answer ubiquitous questions such as: How much of the inter-individual differences in whole-genome sequence genotypes can be predicted using a low-density SNP array? Canonical Correlation Analysis (CCA, [1]), Multivariate-Analysis of Variance (MANOVA, [2]) and Reduced Rank-Regressions, (e.g., Partial Least Squares, PLS, [3]) are three methodologies often used to assess associations in multi-dimensional problems. These approaches have limitations that make some of them inadequate for estimating the proportion of variance explained when both the output and input layers are high-dimensional

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.