Abstract

Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - the Major Histocompatibility Complex (MHC). Here, we develop a human genome benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle sample HG002. We assemble a single contig for each haplotype, align them to the reference, call phased small and structural variants, and define a small variant benchmark for the MHC, covering 94% of the MHC and 22368 variants smaller than 50 bp, 49% more variants than a mapping-based benchmark. This benchmark reliably identifies errors in mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks.

Highlights

  • Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads enable us to construct accurate, phased de novo assemblies

  • We develop a local de novo assembly method using whole-genome sequencing (WGS) data from highly accurate long reads that are partitioned into the two haplotypes using ultralong and linked reads

  • Given that the number of observed distinct human leukocyte antigen (HLA) alleles is still increasing, analysis of Major Histocompatibility Complex (MHC) haplotype structures in the whole human population will be difficult without a large number of high quality reference sequences

Read more

Summary

Introduction

Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads enable us to construct accurate, phased de novo assemblies. The short reads used to develop the small variant benchmarks cannot be uniquely mapped to many repetitive regions of the genome, such as segmental duplications, tandem repeats, and mobile elements. This includes the very challenging but medically important ~5 million base-pair (bp) region in the human genome called the Major Histocompatibility Complex (MHC). The MHC contains a set of human leukocyte antigen (HLA) genes that play crucial roles in autoimmunity and response to infection, including adaptive and innate immunity[5] It is exceptionally variable between individuals and very challenging to characterize with conventional methods since short reads are too different from the reference to map correctly. Short reads have been used to assemble much of the MHC, but the assembly was highly fragmented even for haploid cell lines[10]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call