Abstract

De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid human genome assembly, we present Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN. Using a single PromethION nanopore sequencer and our toolkit, we assembled 11 highly contiguous human genomes de novo in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values and 6.5× coverage in reads >100 kb using three flow cells per sample. Shasta produced a complete haploid human genome assembly in under 6 h on a single commercial compute node. MarginPolish and HELEN polished haploid assemblies to more than 99.9% identity (Phred quality score QV = 30) with nanopore reads alone. Addition of proximity-ligation sequencing enabled near chromosome-level scaffolds for all 11 genomes. We compare our assembly performance to existing methods for diploid, haploid and trio-binned human samples and report superior accuracy and speed.

Highlights

  • Reference-based methods such as GATK1 can infer human variations from short-read sequences, but the results only cover ~90% of the reference human genome assembly[2,3]

  • A total of two nuclease flushes were performed per flow cell, and each flow cell received a total of three sequencing libraries

  • For the HG00733 and CHM13 samples we found that Shasta assemblies polished with MarginPolish and homopolymer encoded long-read error-corrector for Nanopore (HELEN) contained most human protein coding genes, having, respectively, an identified ortholog for 99.23% (152 missing) and 99.11% (175 missing) of these genes

Read more

Summary

Introduction

Reference-based methods such as GATK1 can infer human variations from short-read sequences, but the results only cover ~90% of the reference human genome assembly[2,3]. These methods are accurate with respect to single-nucleotide variants and short insertions and deletions (indels) in this mappable portion of the reference genome[4]. In addition to increasingly being used in reference guided methods[2,14,15,16], long-read sequences can generate highly contiguous de novo genome assemblies[17]. As commercialized by Oxford Nanopore Technologies (ONT), is useful for de novo genome assembly because it can produce high yields of very long 100+ kilobase (kb) reads[18]. We use a combination of nanopore and proximity-ligation (HiC) sequencing[9] and our toolkit, and we report improvements in human genome sequencing coupled with reduced time, labor and cost

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call