Abstract
Genome-Wide association studies (GWAS), based on testing one single nucleotide polymorphism (SNP) at a time, have revolutionized our understanding of the genetics of complex traits. In GWAS, there is a need to consider confounding effects such as due to population structure, and take groups of SNPs into account simultaneously due to the “polygenic” attribute of complex quantitative traits. In this paper, we propose a new approach SGL-LMM that puts together sparse group lasso (SGL) and linear mixed model (LMM) for multivariate associations of quantitative traits. LMM, as has been often used in GWAS, controls for confounders, while SGL maintains sparsity of the underlying multivariate regression model. SGL-LMM first sets a fixed zero effect to learn the parameters of random effects using LMM, and then estimates fixed effects using SGL regularization. We present efficient algorithms for hyperparameter tuning and feature selection using stability selection. While controlling for confounders and constraining for sparse solutions, SGL-LMM also provides a natural framework for incorporating prior biological information into the group structure underlying the model. Results based on both simulated and real data show SGL-LMM outperforms previous approaches in terms of power to detect associations and accuracy of quantitative trait prediction.
Highlights
Quantitative traits are important in medicine, agriculture, and evolution but, until recently, few polymorphisms have been shown to be related in these traits
Experiments on semi-empirical data showed that the combination of sparse group lasso and a linear mixed model yielded better power to identify marker associations in a large range of settings, and application to real datasets have verified that SGL-LMM generated a sparse solution with accurate prediction of phenotypes and interpretable detection of marker associations
We modeled the phenotype as a sum of three terms: a fixed effect determined by the association single nucleotide polymorphism (SNP), a random confounding effect due to population structure, and an i.i.d. noise as follows: y = Xβ + ypop + φ where y is a vector of observed phenotypes of size m × 1 for m samples, X is a m × q matrix that consists of SNPs and other variables of the m samples, ypop is a m × 1 random matrix with distribution N (0, σg2K) where K
Summary
Quantitative traits are important in medicine, agriculture, and evolution but, until recently, few polymorphisms have been shown to be related in these traits. The SGL has a L2 penalty that promotes the selection of only a subset of the groups and L1 penalty that promotes the selection of only a subset of the predictors within a group Another important factor in genetic association studies is the existence of confounding, which are indirect associations between markers and traits due to factors like population structure, family structure, and cryptic relatedness. Experiments on semi-empirical data showed that the combination of sparse group lasso and a linear mixed model yielded better power to identify marker associations in a large range of settings, and application to real datasets have verified that SGL-LMM generated a sparse solution with accurate prediction of phenotypes and interpretable detection of marker associations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have