Abstract

The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.

Highlights

  • The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity

  • In the Human Genome Project completed in 2003, technical limitations restricted sequencing to the euchromatic regions of the genome

  • The prohibitive cost and effort of sequencing made it impractical to generate more than one reference-quality human genome at that time

Read more

Summary

Introduction

The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Some factors limit the utility of the human reference genome for genome analysis It is a composite haplotype derived from a small number of donors recruited at one location in the US, with 70% of its sequences coming from a single DNA donor[1], and so it does not capture the diversity of the world’s population. It does not contain numerous unique sequences found in multiple individuals but not in the human reference genome[2,3,4,5,6,7,8]. This work presents an important first step towards a human reference genome that represents the diversity of human populations

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.