Abstract

Genome-Wide association studies (GWAS), based on testing one single nucleotide polymorphism (SNP) at a time, have revolutionized our understanding of the genetics of complex traits. In GWAS, there is a need to consider confounding effects such as due to population structure, and take groups of SNPs into account simultaneously due to the “polygenic” attribute of complex quantitative traits. In this paper, we propose a new approach SGL-LMM that puts together sparse group lasso (SGL) and linear mixed model (LMM) for multivariate associations of quantitative traits. LMM, as has been often used in GWAS, controls for confounders, while SGL maintains sparsity of the underlying multivariate regression model. SGL-LMM first sets a fixed zero effect to learn the parameters of random effects using LMM, and then estimates fixed effects using SGL regularization. We present efficient algorithms for hyperparameter tuning and feature selection using stability selection. While controlling for confounders and constraining for sparse solutions, SGL-LMM also provides a natural framework for incorporating prior biological information into the group structure underlying the model. Results based on both simulated and real data show SGL-LMM outperforms previous approaches in terms of power to detect associations and accuracy of quantitative trait prediction.

Highlights

  • Quantitative traits are important in medicine, agriculture, and evolution but, until recently, few polymorphisms have been shown to be related in these traits

  • Experiments on semi-empirical data showed that the combination of sparse group lasso and a linear mixed model yielded better power to identify marker associations in a large range of settings, and application to real datasets have verified that SGL-LMM generated a sparse solution with accurate prediction of phenotypes and interpretable detection of marker associations

  • We modeled the phenotype as a sum of three terms: a fixed effect determined by the association single nucleotide polymorphism (SNP), a random confounding effect due to population structure, and an i.i.d. noise as follows: y = Xβ + ypop + φ where y is a vector of observed phenotypes of size m × 1 for m samples, X is a m × q matrix that consists of SNPs and other variables of the m samples, ypop is a m × 1 random matrix with distribution N (0, σg2K) where K

Read more

Summary

INTRODUCTION

Quantitative traits are important in medicine, agriculture, and evolution but, until recently, few polymorphisms have been shown to be related in these traits. The SGL has a L2 penalty that promotes the selection of only a subset of the groups and L1 penalty that promotes the selection of only a subset of the predictors within a group Another important factor in genetic association studies is the existence of confounding, which are indirect associations between markers and traits due to factors like population structure, family structure, and cryptic relatedness. Experiments on semi-empirical data showed that the combination of sparse group lasso and a linear mixed model yielded better power to identify marker associations in a large range of settings, and application to real datasets have verified that SGL-LMM generated a sparse solution with accurate prediction of phenotypes and interpretable detection of marker associations

Method
Sparse Group Lasso
Phenotype Prediction
Model Selection
Application With Arabidopsis thaliana Data
Existing Methods
Performance Measurements
Alternative Methods
Application With Arabidopsis thaliana
DISCUSSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call