Abstract

Genome assemblies are currently being produced at an impressive rate by consortia and individual laboratories. The low costs and increasing efficiency of sequencing technologies now enable assembling genomes at unprecedented quality and contiguity. However, the difficulty in assembling repeat‐rich and GC‐rich regions (genomic “dark matter”) limits insights into the evolution of genome structure and regulatory networks. Here, we compare the efficiency of currently available sequencing technologies (short/linked/long reads and proximity ligation maps) and combinations thereof in assembling genomic dark matter. By adopting different de novo assembly strategies, we compare individual draft assemblies to a curated multiplatform reference assembly and identify the genomic features that cause gaps within each assembly. We show that a multiplatform assembly implementing long‐read, linked‐read and proximity sequencing technologies performs best at recovering transposable elements, multicopy MHC genes, GC‐rich microchromosomes and the repeat‐rich W chromosome. Telomere‐to‐telomere assemblies are not a reality yet for most organisms, but by leveraging technology choice it is now possible to minimize genome assembly gaps for downstream analysis. We provide a roadmap to tailor sequencing projects for optimized completeness of both the coding and noncoding parts of nonmodel genomes.

Highlights

  • With the advent of generation sequencing (NGS) technologies, the field of genomics has grown exponentially and during the last 10 years the genomes of almost 10,000 species of prokaryotes and eukaryotes have been sequenced

  • Long-read assemblies are more complete than short-read assemblies but completeness has been usually measured with statistics that are optimized for short reads rather than for long reads

  • Scaffold N50 and BUSCO values do not reflect the full potential and strengths of new sequencing technologies, and we measured completeness focusing on the most difficult-to-assemble genomic regions

Read more

Summary

| INTRODUCTION

With the advent of generation sequencing (NGS) technologies, the field of genomics has grown exponentially and during the last 10 years the genomes of almost 10,000 species of prokaryotes and eukaryotes have been sequenced (from NCBI Assembly database, O'Leary et al, 2015). Two sequencing strategies have been developed that produce very long reads from single molecules: (a) Pacific Biosciences (PacBio) SMRT sequencing, where DNA polymerases incorporate fluorescently labelled nucleotides and the luminous signals are captured in real time by a camera (Eid et al, 2009); and (b) Oxford Nanopore Technologies, which records the electrical changes caused by the passage of the different nucleotides through voltage-sensitive synthetic pores (Deamer et al, 2016) These new sequencing techniques have already yielded numerous highly contiguous de novo assemblies (Bickhart et al, 2017; Faino et al, 2015; Gordon et al, 2016; Michael et al, 2018; Seo et al, 2016; Weissensteiner et al, 2017; Yoshimura et al, 2019) and helped to improve the completeness of existing ones (Chaisson et al, 2014; Jain et al, 2018), as well as characterize complex genomic regions like the human Y centromere and MHC gene clusters (Jain et al, 2018; Rhoads & Au, 2015; Sedlazeck et al, 2018; Westbrook et al, 2015). We provide a general roadmap for how to investigate previously hidden genomic features

| MATERIALS AND METHODS
| DISCUSSION
Findings
| Strengths and limitations of sequencing technologies
| CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.