Abstract

Many applications of high throughput sequencing rely on the availability of an accurate reference genome. Variant calling often produces large data sets that cannot be realistically validated and which may contain large numbers of false-positives. Errors in the reference assembly increase the number of false-positives. While resources are available to aid in the filtering of variants from human data, for other species these do not yet exist and strict filtering techniques must be employed which are more likely to exclude true-positives. This work assesses the accuracy of the pig reference genome (Sscrofa10.2) using whole genome sequencing reads from the Duroc sow whose genome the assembly was based on. Indicators of structural variation including high regional coverage, unexpected insert sizes, improper pairing and homozygous variants were used to identify low quality (LQ) regions of the assembly. Low coverage (LC) regions were also identified and analyzed separately. The LQ regions covered 13.85% of the genome, the LC regions covered 26.6% of the genome and combined (LQLC) they covered 33.07% of the genome. Over half of dbSNP variants were located in the LQLC regions. Of copy number variable regions identified in a previous study, 86.3% were located in the LQLC regions. The regions were also enriched for gene predictions from RNA-seq data with 42.98% falling in the LQLC regions. Excluding variants in the LQ, LC, or LQLC from future analyses will help reduce the number of false-positive variant calls. Researchers using WGS data should be aware that the current pig reference genome does not give an accurate representation of the copy number of alleles in the original Duroc sow’s genome.

Highlights

  • Contemporary genetics research benefits from genomics tools and resources, including DNA sequencing and single nucleotide polymorphism (SNP) chips, which facilitate detailed quantitative molecular characterization of genetic variation at the population and individual level

  • This work emphasizes the importance of accuracy in reference genomes in variant discovery research

  • We are able to assess the assembly without introducing potential true variation that may be present by chance in multiple individuals; regions of the genome may have been incorrectly identified as low-quality due to true structural variation at heterozygous sites

Read more

Summary

Introduction

Contemporary genetics research benefits from genomics tools and resources, including DNA sequencing and single nucleotide polymorphism (SNP) chips, which facilitate detailed quantitative molecular characterization of genetic variation at the population and individual level. A high quality reference genome sequence for the species of interest is an invaluable asset for the discovery of molecular genetic variants. Identifying Misassemblies in Sscrofa10.2 number variation (CNV) within species, such individual reference genomes do not contain all the sequences present in the species of interest. There are two major flaws in the current single linear model for reference genomes as a framework for discovery and analysis of genetic variation: (1) errors and gaps in the reference genome assemblies most of which are incomplete drafts; and (2) using a haploid genome of one individual to represent the genome(s) of a species.

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.