Abstract
The origin and fate of new mutations within species is the fundamental process underlying evolution. However, while much attention has been focused on characterizing the presence, frequency, and phenotypic impact of genetic variation, the evolutionary histories of most variants are largely unexplored. We have developed a nonparametric approach for estimating the date of origin of genetic variants in large-scale sequencing data sets. The accuracy and robustness of the approach is demonstrated through simulation. Using data from two publicly available human genomic diversity resources, we estimated the age of more than 45 million single-nucleotide polymorphisms (SNPs) in the human genome and release the Atlas of Variant Age as a public online database. We characterize the relationship between variant age and frequency in different geographical regions and demonstrate the value of age information in interpreting variants of functional and selective importance. Finally, we use allele age estimates to power a rapid approach for inferring the ancestry shared between individual genomes and to quantify genealogical relationships at different points in the past, as well as to describe and explore the evolutionary history of modern human populations.
Highlights
Each generation, a human genome acquires an average of about 70 single-nucleotide changes through mutation in the germline of its parents [1]
We compared our approach for estimating the the most recent common ancestor (TMRCA) to the computationally more demanding pairwise sequentially Markovian coalescent (PSMC) methodology [13], which forms the basis of many applications in ancestral inference [14, 26]
Pairwise estimates of the TMRCA between concordant haplotypes were highly correlated with true TMRCA in both Genealogical Estimation of Variant Age (GEVA) (ρ = 0.922) and PSMC (ρ = 0.919), but the correlation for discordant pairs was lower in GEVA (ρ = 0.586) compared to PSMC (ρ = 0.766); see S1 Fig. Such differences in relation to estimating allele age with high accuracy are tolerated because the time of mutation is estimated from the composite distribution of TMRCA posteriors from many pairwise comparisons performed at a single locus
Summary
A human genome acquires an average of about 70 single-nucleotide changes through mutation in the germline of its parents [1]. At a global scale, many millions of new variants are generated each year, the vast majority are lost rapidly through genetic drift and purifying selection. Even though the majority of variants themselves are extremely rare, the majority of genetic differences between genomes result from variants found at global frequencies of 1% or more [2], which may have appeared thousands of generations ago. Genome sequencing studies [3] have catalogued the vast majority of common variation (estimated to be about 10 million variants [4]), and, at least within coding regions and particular ancestries, to date, more than 660 million variants genome-wide have been reported [5], many of them at extremely low frequency [2].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.