Abstract

We report the Simons Genome Diversity Project (SGDP) dataset: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioral modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that in other non-Africans.

Highlights

  • We report the Simons Genome Diversity Project (SGDP) dataset: high quality genomes from 300 individuals from 142 diverse populations

  • The SGDP dataset highlights the incompleteness of current catalogs of human variation, with the fraction of heterozygous positions not discovered by the 1000 Genomes Project being 11% in the KhoeSan and 5% in New Guineans and Australians (Extended Data Fig. 1; Supplementary Data Table 1)

  • We find that FermiKit has comparable sensitivity and specificity to Genome Analysis Toolkit (GATK) for single nucleotide polymorphisms (SNPs) discovery and genotyping, and is more accurate for indels (Supplementary Information section 4)

Read more

Summary

Data set and catalog of novel variants

We sequenced the samples to an average coverage of 43-fold (range 34–83 fold) at Illumina Ltd.; almost all samples (278) were prepared using the same PCR-free library preparation[2]. At “filter level 1” which we recommend for most analyses, we retain an average of 2.13 Gb of sequence per sample and identify 34.4 million single nucleotide polymorphisms (SNPs) and 2.1 million insertion/deletion polymorphisms (indels) (Supplementary Information section 2). We used FermiKit[5] to map short reads against each other, store the assemblies in a compressed form that retains all the information required for polymorphism discovery and analysis, and identified SNPs by comparing against the human reference. We find that FermiKit has comparable sensitivity and specificity to GATK for SNP discovery and genotyping, and is more accurate for indels (Supplementary Information section 4). FermiKit identified 5.8 Mb of contigs that are present in the SGDP but absent in the human reference genome presumably because they are deleted there; these contigs which we have made publicly available can be used as “decoys” to improve read mapping (Supplementary Information section 5). The high quality of the STR genotypes (r2=0.92 to capillary sequencing calls) is evident from their accurate reconstruction of population relationships, even for difficult-to-genotype mononucleotide repeats (Extended Data Fig. 2)

The structure of human genetic diversity
The time course of human population separation
Extended Data
Findings
CentralAsiaSiberia Khoesan
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call