Abstract
There is a growing interest in studying natural variation in human gene expression. Studies mapping genetic determinants of expression profiles are often carried out considering the expression of one gene at a time, an approach that is computationally intensive and may be prone to high false-discovery rate because the number of genes under consideration often exceeds tens of thousands. We present an exploratory method for investigating such data and apply it to the data provided as Problem 1 of Genetic Analysis Workshop 15 (GAW15). In multivariate analysis, canonical correlation analysis is a common way to inspect the relationship between two sets of variables based on their correlation. It determines linear combinations of all variables from each data set such that the correlation between the two linear combinations is maximized. However, due to the large number of genes, linear combinations involving all single-nucleotide polymorphism (SNP) loci and gene expression phenotypes lack biological plausibility and interpretability. We introduce sparse canonical correlation analysis, which examines the relationships of many genetic loci and gene expression phenotypes by providing sparse linear combinations that include only a small subset of loci and gene expression phenotypes. These correlated sets of variables are sufficiently small for biological interpretability and further investigation. Applying this method to the GAW15 Problem 1 data, we identified groups of 41 loci and 150 gene expressions with the highest between-group correlation of 43%.
Highlights
Several studies have demonstrated that there is variation in baseline gene expression levels in humans that has a genotypic component [1,2]
A common way to inspect the relationship between two sets of variables based on their correlation is canonical correlation analysis, which determines linear combinations of variables for each data set such that the two linear combinations have maximum correlation
We have developed a new method, sparse canonical correlation analysis (SCCA), which examines the relationships between many genetic loci and gene expression phenotypes
Summary
Several studies have demonstrated that there is variation in baseline gene expression levels in humans that has a genotypic component [1,2]. In this paper we present an exploratory multivariate method for initial investigation of such data and apply it to the data provided as Problem 1 of Genetic Analysis Workshop 15 (GAW15). Due to the large number of genes, linear combinations involving all of the genotypes or gene expression phenotypes lack biological plausibility and interpretability and may not able to be generalized. We have developed a new method, sparse canonical correlation analysis (SCCA), which examines the relationships between many genetic loci and gene expression phenotypes. Only small subsets of the loci and the gene expression phenotypes have non-zero loadings so the solution provides correlated sets of variables that are sufficiently small for biological interpretation and further investigation. The method can help generate new hypotheses and guide further investigation
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.