Abstract
Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at http://www.microsoft.com/science.
Highlights
Population structure, family structure and/or cryptic relatedness are well-known confounding factors that cause spurious associations to be found in genome-wide association studies (GWAS) [1,2,3,4,5,6]
We have presented a novel GWAS method that accounts for confounding factors such as population structure, family structure or cryptic relatedness
Similar to linear mixed-effect models (LMMs) and principal components analysis (PCA)-based methods for association, our model accounts for confounding factors through the use of pairwise similarities between patients, which allows us to significantly reduce false positive rates when performing associations
Summary
Population structure, family structure and/or cryptic relatedness are well-known confounding factors that cause spurious associations to be found in GWAS [1,2,3,4,5,6]. Other methods have been proposed that use a principal components analysis of individuals’ SNPs [4], perform a post-hoc correction of test statistics such as Genomic Control [2], or cluster individuals before performing an aggregate association between clusters and phenotypes [11] These methods, while accounting for confounding factors under different assumptions, have been shown to either suffer from insufficient statistical power when the confounding effects are strong [4,5] or are unable to fully capture their effects altogether, such that many false positives are produced [3,5,12]. In several recent studies [3,5,12,13], methods based on LMMs were found to produce fewer false positives and had higher statistical power as compared to other methods for modeling confounding factors, making LMMs a popular class of GWAS methods that have high statistical power and low false positive rates
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.