Abstract
Genome-wide Association Studies (GWAS) result in millions of summary statistics (“z-scores”) for single nucleotide polymorphism (SNP) associations with phenotypes. These rich datasets afford deep insights into the nature and extent of genetic contributions to complex phenotypes such as psychiatric disorders, which are understood to have substantial genetic components that arise from very large numbers of SNPs. The complexity of the datasets, however, poses a significant challenge to maximizing their utility. This is reflected in a need for better understanding the landscape of z-scores, as such knowledge would enhance causal SNP and gene discovery, help elucidate mechanistic pathways, and inform future study design. Here we present a parsimonious methodology for modeling effect sizes and replication probabilities, relying only on summary statistics from GWAS substudies, and a scheme allowing for direct empirical validation. We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal SNPs. The four-parameter model allows for estimating the degree of polygenicity of the phenotype and predicting the proportion of chip heritability explainable by genome-wide significant SNPs in future studies with larger sample sizes. We apply the model to recent GWAS of schizophrenia (N = 82,315) and putamen volume (N = 12,596), with approximately 9.3 million SNP z-scores in both cases. We show that, over a broad range of z-scores and sample sizes, the model accurately predicts expectation estimates of true effect sizes and replication probabilities in multistage GWAS designs. We assess the degree to which effect sizes are over-estimated when based on linear-regression association coefficients. We estimate the polygenicity of schizophrenia to be 0.037 and the putamen to be 0.001, while the respective sample sizes required to approach fully explaining the chip heritability are 106 and 105. The model can be extended to incorporate prior knowledge such as pleiotropy and SNP annotation. The current findings suggest that the model is applicable to a broad array of complex phenotypes and will enhance understanding of their genetic architectures.
Highlights
Many complex traits and common phenotypes have a genetic component that arises from large numbers of genetic loci (Visscher et al, 2012)
We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal single nucleotide polymorphism (SNP)
The model has great utility in that it allows for the prediction of the replication probability and expected effect size for each SNP, given the discovery and replication sample sizes, the four model parameters, and the discovery sample z-score of the SNP
Summary
Many complex traits and common phenotypes have a genetic component that arises from large numbers of genetic loci (Visscher et al, 2012). The total effect of the genetic component on phenotypic expression is often substantial, as indicated by measures of heritability (Tenesa and Haley, 2013; Witte et al, 2014) obtained from twin and family studies and genome-wide association studies (GWAS) for multiple phenotypes. GWAS provide a platform for uncovering the underlying genetic architecture, but this poses a substantial challenge, compounded by the complexity of the datasets: ∼104–105 individuals with ∼107 genetic markers (single nucleotide polymorphisms, or SNPs) in various levels of correlation (linkage disequilibrium, or LD), ∼106 of which are estimated to be independent (Dudbridge and Gusnanto, 2008; Pe’er et al, 2008), with multiple possible roles for SNPs in mechanistic pathways. With the number of markers much larger than number of individuals in GWAS, modeling assumptions are required so as to estimate parameters of interest and thereby obtain realistic descriptions of the numbers, distributions, and effect sizes of causal SNPs—and the considerably larger number of SNPs in strong LD with causal SNPs—which in turn can assist in causal SNP discovery and individual risk prediction, and inform mechanistic understanding of genetic effects in phenotypic expression
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.