Abstract

Although pooled-population sequencing has become a widely used approach for estimating allele frequencies, most work has proceeded in the absence of a proper statistical framework. We introduce a self-sufficient, closed-form, maximum-likelihood estimator for allele frequencies that accounts for errors associated with sequencing, and a likelihood-ratio test statistic that provides a simple means for evaluating the null hypothesis of monomorphism. Unbiased estimates of allele frequencies (where N is the number of individuals sampled) appear to be unachievable, and near-certain identification of a polymorphism requires a minor-allele frequency . A framework is provided for testing for significant differences in allele frequencies between populations, taking into account sampling at the levels of individuals within populations and sequences within pooled samples. Analyses that fail to account for the two tiers of sampling suffer from very large false-positive rates and can become increasingly misleading with increasing depths of sequence coverage. The power to detect significant allele-frequency differences between two populations is very limited unless both the number of sampled individuals and depth of sequencing coverage exceed 100.

Highlights

  • An increasingly popular approach to characterizing the genetic variation in a population involves pooling DNA from a large number of individuals into one sample from which a single DNA library is extracted

  • Within certain constraints, pooled sampling has a number of potentially useful applications, for example, discovering single-nucleotide polymorphisms (SNPs), ascertaining the site-frequency spectrum within a population, determining patterns of variation at various classes of sites, and evaluating the amount of genetic differentiation among populations (Van Tassell et al 2008; Futschik and Schlotterer 2010; Kofler et al 2011; Boitard et al 2012, 2013; Chubiz et al 2012; Lamichhaney et al 2012; Zhu et al 2012; Gautier et al 2013; Navon et al 2013; Konczal et al 2014; Lieberman et al 2014)

  • The typical approach is to rely on arbitrary coverage cutoffs in inferring the validity of an SNP at a particular site, with the contributions from sequencing errors being dealt with in arbitrary or undisclosed ways

Read more

Summary

Introduction

An increasingly popular approach to characterizing the genetic variation in a population involves pooling DNA from a large number of individuals into one sample from which a single DNA library is extracted. À p^ MÞ2NÀi nMMi nmmi , ð7Þ i1⁄40 where p^ M is the ML estimate of the major-allele frequency in the sample; nM and nm are the numbers of counts for major and minor alleles in the sample, respectively; and Mi and mi are defined as in equations (1a) and (1b) with i=ð2NÞ substituted for p This expression approximates the total likelihood for a set of reads by summing over the probabilities of all possible samplings of the alleles from the population and accounting for the probability of the observed quartet given the sample. To test for the significance of an allele-frequency difference between two samples, we first require the joint likelihoods of the observed reads in both samples starting with the assumption of population homogeneity For such purposes, we start with summed quartets over both populations to obtain an estimate of the total major-allele frequency p^ T using equation (3b). The number of changes exceeded that expected after accounting for the expected contribution from genetic drift, no single variant exhibited a significant change

Discussion
Findings
Literature Cited
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call