Abstract

SNP (single nucleotide polymorphism) discovery using next-generation sequencing data remains difficult primarily because of redundant genomic regions, such as interspersed repetitive elements and paralogous genes, present in all eukaryotic genomes. To address this problem, we developed Sniper, a novel multi-locus Bayesian probabilistic model and a computationally efficient algorithm that explicitly incorporates sequence reads that map to multiple genomic loci. Our model fully accounts for sequencing error, template bias, and multi-locus SNP combinations, maintaining high sensitivity and specificity under a broad range of conditions. An implementation of Sniper is freely available at http://kim.bio.upenn.edu/software/sniper.shtml.

Highlights

  • The advent of next-generation, short-read sequencing (NGS) technologies has enabled large-scale, whole-genome resequencing studies that aim to discover novel single nucleotide polymorphism (SNP) and other population genetic variations

  • Previous genome resequencing efforts have developed a variety of approaches to identify SNPs, including straightforward decision rules such as minimum coverage and quality cutoffs along with filters that mask reads aligning to repetitive genomic templates [2]; Bayesian algorithms that explicitly model sequencing chemistry and take full advantage of read-specific quality scores [3,4]; unsupervised [5] and supervised [6,7] machinelearning algorithms trained to distinguish sequencing errors from SNPs; and an alignment method that performs read mapping using all four nucleotide probabilities per-locus instead of the most probable call [8]

  • A SNP occurring within a repetitive sequence may be identified from overlapping reads that are anchored by unique flanking template, accurate mapping may be impossible if the length of the repetitive sequence is greater than the length of the read

Read more

Summary

Introduction

The advent of next-generation, short-read sequencing (NGS) technologies has enabled large-scale, whole-genome resequencing studies that aim to discover novel SNPs and other population genetic variations. Previous genome resequencing efforts have developed a variety of approaches to identify SNPs, including straightforward decision rules such as minimum coverage and quality cutoffs along with filters that mask reads aligning to repetitive genomic templates [2]; Bayesian algorithms that explicitly model sequencing chemistry and take full advantage of read-specific quality scores [3,4]; unsupervised [5] and supervised [6,7] machinelearning algorithms trained to distinguish sequencing errors from SNPs; and an alignment method that performs read mapping using all four nucleotide probabilities per-locus instead of the most probable call [8] These tools have successfully predicted many novel SNPs, genomes themselves contain inherent degeneracy due to redundant paralogous sequences and low complexity repetitive elements, while NGS data exhibit non-negligible sequencing errors and severe. SNPs occurring in redundant sequence contexts may be missed

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.