Abstract

BackgroundOwing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value.ResultsWe present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes.ConclusionsThis study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.Electronic supplementary materialThe online version of this article (doi:10.1186/s13742-015-0076-3) contains supplementary material, which is available to authorized users.

Highlights

  • Owing to the complexity of the assembly problem, we do not yet have complete genome sequences

  • We demonstrate the broad applicability of Long Interval Nucleotide K-mer Scaffolder (LINKS), by rescaffolding a high-quality draft of the 120-Mbp A. thaliana Ler-1 genome [9] with either raw or corrected [10, 18] long sequence reads from Pacific Biosciences (PacBio)

  • We find that the resulting LINKS assemblies are very contiguous, especially when the PacBio reads are corrected (NG50 > 2.5 Mbp), and highlights 1) the utility of LINKS for retrospective scaffolding of draft genomes with new long read sequencing data and that 2) LINKS scaffolding can be complimentary to read correction methodologies (Additional file 1: Figure S7)

Read more

Summary

Introduction

Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value. Quick and colleagues [6] publicly released ONT E. coli long reads as part of the MAP Their assessment identified some of the shortcomings of the current technology, it highlighted its great potential, including a low-cost throughput and kilobase-long reads

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.