Abstract
Although genome sequencing has become increasingly popular, the simulation of individual genomes is still important. This is because sequencing a large number of individual genomes is costly and genome data with extreme and boundary conditions, such as fatal genetic defects, are difficult to obtain. Privacy and legal barriers also prevent many applications of real data. Large sequencing projects in recent years have provided a deeper understanding of the human genome. However, there is a lack of tools to leverage known data to simulate personal genomes as real as possible. Here, we designed and developed PGsim, a comprehensive and highly customizable individual genome simulator, that fully uses existing knowledge, such as variant allele frequencies in global or world main populations, mutation probability differences between protein-coding regions and non-coding regions, transition/transversion (Ti/Tv) ratios, Indel incidence, Indel length distribution, structural variation sites, and pathogenic mutation sites. Users can flexibly control the proportion and quantity of known variants, common variants, novel variants in both coding and non-coding regions, and special variants through detailed parameter settings. To ensure that the simulated personal genome has sufficient randomness, PGsim makes the generated variants more real and reliable in terms of variant distribution, proportion, and population characteristics. PGsim is able to employ a huge volume database as background data to simulate personal genomes and does not require SQL database support. Users can easily change the variant databases used as needed. As a Perl script, there is no obstacle to running PGsim on any version of the MAC OS or Linux systems, and no libraries, packages, interpreters, compilers, or other dependencies need to be installed in advance. The PGsim tool is publicly available at https://github.com/lrjuan/PGsim.
Highlights
Personal genome sequencing has become increasingly popular, the simulation of individual genomes remains important
We developed PGsim, a comprehensive and highly customizable personal genome simulator, by integrating the above information
According to the parameters specified by the user, PGsim comprehensively considers type, source, location, allele frequency (AF), Ti/Tv ratio, and other information about genomic variants and randomly extracts them from known variant databases or generates them randomly based on the user configuration
Summary
Genome sequencing has become increasingly popular, the simulation of individual genomes is still important. This is because sequencing a large number of individual genomes is costly and genome data with extreme and boundary conditions, such as fatal genetic defects, are difficult to obtain. There is a lack of tools to leverage known data to simulate personal genomes as real as possible. To ensure that the simulated personal genome has sufficient randomness, PGsim makes the generated variants more real and reliable in terms of variant distribution, proportion, and population characteristics.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.