Abstract

BackgroundA number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima’s D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. However, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions.ResultsWe have developed an approach that accommodates the uncertainty of the data when calculating site frequency based neutrality test statistics. A salient feature of this approach is that it implicitly solves the problems of varying sequencing depth, missing data and avoids the need to infer variable sites for the analysis and thereby avoids ascertainment problems introduced by a SNP discovery process.ConclusionUsing an empirical Bayes approach for fast computations, we show that this method produces results for low-coverage NGS data comparable to those achieved when the genotypes are known without uncertainty. We also validate the method in an analysis of data from the 1000 genomes project. The method is implemented in a fast framework which enables researchers to perform these neutrality tests on a genome-wide scale.

Highlights

  • A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima’s D

  • The effect of genotype calling for low or medium coverage data In order to evaluate the performance of the estimators we simulated multiple genomic regions both without selection and with strong positive selection

  • In this paper we show through simulations that estimating neutrality test statistics using called genotypes can lead to highly biased result

Read more

Summary

Introduction

A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima’s D. An often used approach for detecting selection is to use a neutrality test statistic based on allele frequencies, with Tajima’s D being the most famous. Most genotype callers are relatively conservative and only call a site to be heterozygous if there is substantial evidence that it is heterozygous Such methods will tend to underestimate the allele frequency of the minor allele. An attempt to alleviate this problem has made by only calling a genotype when there is substantial statistical evidence supporting the genotype call Such approaches will generate a considerable amount of missing data which leads to biases if not adequately dealt with [20]. This is contrasted by the methods of [19,24] that can use quality scores for SNP discovery and incorporating the quality scores into the parameter estimation directly

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call