Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications

Ziwen He,Xinnian Li,Yun-Xin Fu,Eric Hungate,Suhua Shi,Chung-I Wu,Shaoping Ling

doi:10.1186/1471-2164-14-535

Abstract

BackgroundAs the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.ResultsBy computer simulations, we compare the two methods of data acquisition - sequencing each diploid individual separately and sequencing the pooled sample. Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X). We hence propose a new method for estimating θ from pooled samples that have been subjected to two separate rounds of DNA sequencing. Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors. Simulation results show that the dual applications method is reliable even when the error rate is high and θ is low.ConclusionsIn studies of natural populations where the sequencing coverage is usually modest (~2X per individual), the dual applications method on pooled samples should be a reasonable choice.

Highlights

As the error rate is high and the distribution of errors across sites is non-uniform in generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from next generation sequencing (NGS) data
As polymorphism in natural populations is dominated by low frequency variants [9], which are often indistinguishable from sequencing errors, using the new sequencing technologies to estimate polymorphism will remain a challenge in the near future
To ensure false positive error rate being less than 10%, it need more than 20X depth for most generation sequencing platforms [3]

Summary

Introduction

As the error rate is high and the distribution of errors across sites is non-uniform in generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data. Θ is the number of nucleotide differences between two sequences of the same locus, randomly chosen from the population. It is a good measure of genetic diversity and a basic parameter for doing population genetic analysis (e.g. tests of positive selection, [6,7,8]). As polymorphism in natural populations is dominated by low frequency variants [9], which are often indistinguishable from sequencing errors, using the new sequencing technologies to estimate polymorphism will remain a challenge in the near future. Since error signals may vary from operation to operation, its general applicability will need to be evaluated

Methods

Results

Discussion

Conclusion