Abstract

Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.

Highlights

  • The development of long-read sequencing technologies has revolutionized the genome assembly landscape

  • The DNA molecule that is in almost every cell in a living organism can be represented as sequence of four different nucleotides, or bases denoted by letters A,C,G, and T

  • The current sequencing technologies require breaking the DNA molecule into short fragments, sequencing them to find the corresponding sequence of letters, producing “reads”, and assembly, which recovered the DNA sequence from the reads

Read more

Summary

Introduction

The development of long-read sequencing technologies has revolutionized the genome assembly landscape It is possible for example, to get reads that are up to a million bases long from Oxford Nanopore’s MinION and PromethION instruments [1]. For example, to get reads that are up to a million bases long from Oxford Nanopore’s MinION and PromethION instruments [1] These "ultralong" reads are extremely helpful for assembling genomic regions filled with long, complex repeats. For many genomes that have already been sequenced, existing assemblies already capture most of the non-repetitive sequence, obviating the need to generate additional short-read data. For these genome assemblies, adding long read data might provide a fast, cost-effective way to improve their contiguity

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call