Abstract

Our work is motivated by the search for metabolite quantitative trait loci (QTL) in a cohort of more than 5000 people. There are 158 metabolites measured by NMR spectroscopy in the 31-year follow-up of the Northern Finland Birth Cohort 1966 (NFBC66). These metabolites, as with many multivariate phenotypes produced by high-throughput biomarker technology, exhibit strong correlation structures. Existing approaches for combining such data with genetic variants for multivariate QTL analysis generally ignore phenotypic correlations or make restrictive assumptions about the associations between phenotypes and genetic loci. We present a computationally efficient Bayesian seemingly unrelated regressions model for high-dimensional data, with cell-sparse variable selection and sparse graphical structure for covariance selection. Cell sparsity allows different phenotype responses to be associated with different genetic predictors and the graphical structure is used to represent the conditional dependencies between phenotype variables. To achieve feasible computation of the large model space, we exploit a factorisation of the covariance matrix. Applying the model to the NFBC66 data with 9000 directly genotyped single nucleotide polymorphisms, we are able to simultaneously estimate genotype–phenotype associations and the residual dependence structure among the metabolites. The R package BayesSUR with full documentation is available at https://cran.r-project.org/web/packages/BayesSUR/

Highlights

  • Integrating high-d­ imensional molecular biomarker data sets is a fundamental problem in genetic epidemiology and bioinformatics, in the search for molecular mechanisms mediating the effects of genetic variants on clinical phenotypes

  • Our case study is in metabolomics quantitative trait locus, a powerful approach used to identify genes associated with metabolic markers of diseases, where the multivariate response is generally on the order of hundreds of metabolites

  • We provide the version using the sparse covariance structure

Read more

Summary

Introduction

Integrating high-d­ imensional molecular biomarker data sets is a fundamental problem in genetic epidemiology and bioinformatics, in the search for molecular mechanisms mediating the effects of genetic variants on clinical phenotypes. Univariate regressions are performed for each phenotype–g­ enotype pair, needing post hoc adjustment for multiple comparisons and ignoring any correlations between genotypes and between phenotypes This is unlikely to be the best strategy for data where latent structures induce high levels of correlation between phenotypes, for example serum metabolomic profiles (Kettunen et al, 2012; Soininen et al, 2009), imaging and gene expressions. The metabolites set comprises lipoprotein particle concentrations, low molecular weight metabolites such as amino acids, 3-­hydroxybutyrate and creatinine and different serum lipids, including free and esterified cholesterol, sphingomyelin and fatty acid saturation These data exhibit strong residual correlation (Kettunen et al, 2012; Marttinen et al, 2014), even after accounting for the variance explained by all reported SNPs

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call