SimGWAS: a fast method for simulation of large scale case-control GWAS summary statistics.

Mary D Fortune,Chris Wallace

doi:10.1093/bioinformatics/bty898

Abstract

MotivationMethods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some ‘truth’ is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study.ResultsWe have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis.Availability and implementationOur method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

The genome wide association study design is more than a decade old (Visscher et al, 2017), and the size of GWAS cohorts has continued to grow, from 1000 s to, 1 000 000 s of individuals
Summary data methods are often derived through approximating a multivariate linear regression likelihood by incorporating information about correlation structures from reference populations
Summary statistic methods which have been originally derived for linear regression cannot do this and the impact of the linearity assumption on their conclusions if applied to case–control data has not been investigated in depth

Summary

Introduction

The genome wide association study design is more than a decade old (Visscher et al, 2017), and the size of GWAS cohorts has continued to grow, from 1000 s to, 1 000 000 s of individuals. Given the competing demands of open science and privacy concerns (P3G Consortium et al, 2009), it has become standard to share data in the form of summary statistics (allelic effect sizes and standard errors, or P values) more readily than the full genotype data. Summary data methods are often derived through approximating a multivariate linear regression likelihood by incorporating information about correlation structures (linkage disequilibrium, LD) from reference populations. Summary statistic methods which have been originally derived for linear regression cannot do this and the impact of the linearity assumption on their conclusions if applied to case–control data has not been investigated in depth

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Bioinformatics	Publication Date: Oct 29, 2018
Citations: 30	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

SimGWAS: a fast method for simulation of large scale case-control GWAS summary statistics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

GWASBrewer: An R Package for Simulating Realistic GWAS Summary Statistics.
Jean Morrison
Genetic epidemiology | VOL. -
Jean MorrisonJean Morrison
06 Oct 2024
Genetic epidemiology | VOL. -

SU81 - COMPREHENSIVE EVALUATION OF ENRICHMENT FOR CIRCADIAN CLOCK GENE SETS IN PSYCHIATRIC TRAITS: SPECIFIC ENRICHMENT IN CLINICAL RESPONSE TO LITHIUM
Sergi Papiol ... Thomas Schulze
European Neuropsychopharmacology | VOL. 29
Sergi Papiol, et. al.Sergi Papiol ... Thomas Schulze
01 Jan 2019
European Neuropsychopharmacology | VOL. 29

Genome-wide multi-trait analysis on cardioembolic stroke identifies 47 novel loci
L Meseguer Monfort ... L Andreasen
European Heart Journal | VOL. 43
L Meseguer Monfort, et. al.L Meseguer Monfort ... L Andreasen
03 Oct 2022
European Heart Journal | VOL. 43

HAPRAP: a haplotype-based iterative method for statistical fine mapping using GWAS summary statistics.
Jie Zheng ... Richard Morris
Bioinformatics (Oxford, England) | VOL. 33
Jie Zheng, et. al.Jie Zheng ... Richard Morris
01 Sep 2016
Bioinformatics (Oxford, England) | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SimGWAS: a fast method for simulation of large scale case-control GWAS summary statistics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics