Abstract

BackgroundSimulation of genetic variants data is frequently required for the evaluation of statistical methods in the fields of human and animal genetics. Although a number of high-quality genetic simulators have been developed, many of them require advanced knowledge in population genetics or in computation to be used effectively. In addition, generating simulated data in the context of family-based studies demands sophisticated methods and advanced computer programming.ResultsTo address these issues, we propose a new user-friendly and integrated R package, sim1000G, which simulates variants in genomic regions among unrelated individuals or among families. The only input needed is a raw phased Variant Call Format (VCF) file. Haplotypes are extracted to compute linkage disequilibrium (LD) in the simulated genomic regions and for the generation of new genotype data among unrelated individuals. The covariance across variants is used to preserve the LD structure of the original population. Pedigrees of arbitrary sizes are generated by modeling recombination events with sim1000G. To illustrate the application of sim1000G, various scenarios are presented assuming unrelated individuals from a single population or two distinct populations, or alternatively for three-generation pedigree data. Sim1000G can capture allele frequency diversity, short and long-range linkage disequilibrium (LD) patterns and subtle population differences in LD structure without the need of any tuning parameters.ConclusionSim1000G fills a gap in the vast area of genetic variants simulators by its simplicity and independence from external tools. Currently, it is one of the few simulation packages completely integrated into R and able to simulate multiple genetic variants among unrelated individuals and within families. Its implementation will facilitate the application and development of computational methods for association studies with both rare and common variants.

Highlights

  • Simulation of genetic variants data is frequently required for the evaluation of statistical methods in the fields of human and animal genetics

  • Genetic simulators might be useful for creating some generalizable benchmark data sets and/or reference simulation program(s) for the user community [3]

  • We used the default parameters specified by each software (Additional file 1) and simulated 2000 genetic variants spanning location 1 to 10MBp on chromosome 1

Read more

Summary

Results

Comparison with other genetic variants simulators We compared sim1000G to two other competing and well established genetic variants simulators: hapgen2 [6] and simuGWAS [14]. The parameters for generating the data with sim1000G included: (a) The length n of the simulated region, where n corresponds to the number of variants with genetic variations; (b) The range of MAFs for the simulated variants. In terms of computation time, sim1000G was more efficient and easier to implement than simuPOP The former only simulates genotype data for a pre-specified sample size while the latter requires for each simulation scenario to generate of a huge initial population from which the final set of individuals is extracted. A few methods have been developed for testing sets of genetic variants in family studies but only few approaches were proposed in the context of right-censored time-to-event data [21]. A correlated frailty model can be used to test the association between a set of genetic variants and a survival outcome in family studies [22]. For datasets with 500 families, the test detects the association in all the replicates

Conclusion
Background
Availability of data and materials Not applicable
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call