Abstract
The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.
Highlights
The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity
In the Human Genome Project completed in 2003, technical limitations restricted sequencing to the euchromatic regions of the genome
The prohibitive cost and effort of sequencing made it impractical to generate more than one reference-quality human genome at that time
Summary
The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Some factors limit the utility of the human reference genome for genome analysis It is a composite haplotype derived from a small number of donors recruited at one location in the US, with 70% of its sequences coming from a single DNA donor[1], and so it does not capture the diversity of the world’s population. It does not contain numerous unique sequences found in multiple individuals but not in the human reference genome[2,3,4,5,6,7,8]. This work presents an important first step towards a human reference genome that represents the diversity of human populations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.