Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes.

Peter Ralph,Kevin Thornton,Jerome Kelleher

doi:10.1534/genetics.120.303253

Peter Ralph, Kevin Thornton + Show 1 more

Open Access

https://doi.org/10.1534/genetics.120.303253

Copy DOI

Abstract

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.

Highlights

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site
We study single-site genetic statistics, i.e., statistics of aligned genome sequence that can be expressed as averages over values computed separately for each site
We develop a theoretical and computational framework that encompasses a large class of population genetic statistics, generalizing many classical summaries of genetic variation

Summary

E Branch f

Á w 1⁄2i; The first term is the variance of the expected site statistic given the trees, which by duality is the variance of the branch statistic, i.e., the contribution of demographic noise, including genetic drift. Top: diversity within the entire population, computed as a site statistic from 20 independent assignments of mutations to the same tree sequence with mutation rate m = 1029. The last of these plots is possible to directly observe in real data: in the top two plots, the spread of independent replicates of mutational noise (black lines) about their expectation based on the tree sequence (red line) is unobservable, estimable As another example, we simulated an admixture scenario: a first population of size N = 1000 splits into two of equal size, after N generations, the second population splits, after another N generations, the third population splits again, and for the final N generations populations 2 and 3 have percapita migration rates of 4/N to each other. In practice imperfect estimation of the tree sequence would introduce additional noise, so it remains to be seen if the improvement could be made in practice

Data and Code Availability

Discussion

Literature Cited