Abstract
Viruses exist within hosts at large population sizes and are subject to high rates of mutation. As such, viral populations exhibit considerable sequence diversity. A variety of summary statistics have been developed which describe, in a single number, the extent of diversity in a viral population; such measurements allow the diversities of different populations to be compared, and the effect of evolutionary forces on a population to be assessed. Here we highlight statistical artefacts underlying some common measures of sequence diversity, whereby variation in the depth of genome sequencing may substantially affect the extent of diversity measured in a viral population, making comparisons of population diversity invalid. Specifically, naive estimation of sequence entropy provides a systematically biased metric, a lower read depth being expected to produce a lower estimate of diversity. The number of polymorphic loci per kilobase of genome is more unpredictably affected by read depth, giving potentially flawed results at lower sequencing depths. We show that the nucleotide diversity statistic π provides an unbiased estimate of diversity in the sense that the expected value of the statistic is equal to the correct value of the property being measured. Our results are of importance for studies interpreting genome sequence data; we describe how diversity may be assessed in viral populations in a fair and unbiased manner.
Highlights
Many viruses form large within-host populations and evolve under the influence of high mutation rates
We highlight statistical artefacts underlying some common measures of sequence diversity, whereby variation in the depth of genome sequencing may substantially affect the extent of diversity measured in a viral population, making comparisons of population diversity invalid
While sequence diversity is complex property, there exist a range of statistical measures of diversity, each capturing the diversity of a population in a single numerical value
Summary
Many viruses form large within-host populations and evolve under the influence of high mutation rates. Within-host viral populations may contain a large amount of sequence diversity (Lauring, Frydman, and Andino 2013). While sequence diversity is complex property, there exist a range of statistical measures of diversity, each capturing the diversity of a population in a single numerical value. Such measures, which include the number of polymorphisms per thousand bases, sequence entropy, and the population genetics parameter p, allow for the simple evaluation of changes in population diversity. Increases and decreases in diversity may be measured over time (Gall et al 2013; Maldarelli et al 2013)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have