Abstract

BackgroundAs the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.ResultsBy computer simulations, we compare the two methods of data acquisition - sequencing each diploid individual separately and sequencing the pooled sample. Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X). We hence propose a new method for estimating θ from pooled samples that have been subjected to two separate rounds of DNA sequencing. Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors. Simulation results show that the dual applications method is reliable even when the error rate is high and θ is low.ConclusionsIn studies of natural populations where the sequencing coverage is usually modest (~2X per individual), the dual applications method on pooled samples should be a reasonable choice.

Highlights

  • As the error rate is high and the distribution of errors across sites is non-uniform in generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from next generation sequencing (NGS) data

  • As polymorphism in natural populations is dominated by low frequency variants [9], which are often indistinguishable from sequencing errors, using the new sequencing technologies to estimate polymorphism will remain a challenge in the near future

  • To ensure false positive error rate being less than 10%, it need more than 20X depth for most generation sequencing platforms [3]

Read more

Summary

Introduction

As the error rate is high and the distribution of errors across sites is non-uniform in generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data. Θ is the number of nucleotide differences between two sequences of the same locus, randomly chosen from the population. It is a good measure of genetic diversity and a basic parameter for doing population genetic analysis (e.g. tests of positive selection, [6,7,8]). As polymorphism in natural populations is dominated by low frequency variants [9], which are often indistinguishable from sequencing errors, using the new sequencing technologies to estimate polymorphism will remain a challenge in the near future. Since error signals may vary from operation to operation, its general applicability will need to be evaluated

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call