Abstract

We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.

Highlights

  • Despite these efforts, the current human reference genome derives primarily from a single individual[4], limiting its usefulness for genetic studies, especially among admixed populations, such as those representing the African diaspora

  • Findings from the 1000 Genomes Project indicate that differences between populations are quite large; examination of 26 populations across five continents revealed that 86% of discovered variants were present in only one continental group

  • Other groups have used highly homogenous populations together with assembly-based approaches to discover SNPs and structural variants (SVs), including up to several megabases of non-reference sequence common to these populations[16,17,18,19]. These variant analyses are a step in the right direction, to date, none have produced a reference-quality genome that can replace GRCh38; this is an explicit goal of the Danish Genome Project (URLs)

Read more

Summary

Methods

We used whole-genome shotgun sequence data from 910 individuals whose genomes were sequenced as part of the CAAPA project, available from dbGaP as accession phs001123.v1.p1. If an unplaced contig aligned with ≥​80% coverage and ≥​90% identity, it was removed from the unplaced set, though it was not added into the placed cluster, as it did not meet the stricter placement or containment criteria used to create the clusters. For the placed contigs, because we had already determined which individuals contained these sequences, the genotype matrix was supplemented by adding a presence call (“1”) if we had determined that an individual had a contig in the placement cluster This additional calling allowed increased sensitivity for individuals who had mate placement information available for the insertion, even when the contigs did not meet the identity/coverage criteria used for this presence/absence genotyping. Further information on research design is available in the Nature Research Reporting Summary linked to this article

Statistical parameters
Findings
Antibodies
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.