Abstract
MotivationMethods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some ‘truth’ is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study.ResultsWe have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis.Availability and implementationOur method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS.Supplementary information Supplementary data are available at Bioinformatics online.
Highlights
The genome wide association study design is more than a decade old (Visscher et al, 2017), and the size of GWAS cohorts has continued to grow, from 1000 s to, 1 000 000 s of individuals
Summary data methods are often derived through approximating a multivariate linear regression likelihood by incorporating information about correlation structures from reference populations
Summary statistic methods which have been originally derived for linear regression cannot do this and the impact of the linearity assumption on their conclusions if applied to case–control data has not been investigated in depth
Summary
The genome wide association study design is more than a decade old (Visscher et al, 2017), and the size of GWAS cohorts has continued to grow, from 1000 s to, 1 000 000 s of individuals. Given the competing demands of open science and privacy concerns (P3G Consortium et al, 2009), it has become standard to share data in the form of summary statistics (allelic effect sizes and standard errors, or P values) more readily than the full genotype data. Summary data methods are often derived through approximating a multivariate linear regression likelihood by incorporating information about correlation structures (linkage disequilibrium, LD) from reference populations. Summary statistic methods which have been originally derived for linear regression cannot do this and the impact of the linearity assumption on their conclusions if applied to case–control data has not been investigated in depth
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.