Abstract

Scaffolding errors and incorrect repeat disambiguation during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub or PyPI, the Python Package Index; a tutorial and user documentation are also available.

Highlights

  • De novo assembly is the construction of a long, contiguous genomic sequence from short DNA reads, without using a reference genome

  • Paired end reads with insert sizes of a few hundred bases provide some additional information for repeat disambiguation; even more useful are read pairs with a large insert size of several kilobases, such as the Illumina Nextera mate pairs (Illumina, 2012)

  • We evaluated NxRepair using de novo assemblies of nine bacterial genomes prepared using the SPAdes assembler, showing that of the six genomes containing misassemblies, three could be improved by NxRepair correction; compared with no improvements made by the A5qc module

Read more

Summary

Introduction

De novo assembly is the construction of a long, contiguous genomic sequence from short DNA reads, without using a reference genome. A common method of de novo genome assembly is construction and traversal of a de Bruijn graph (Compeau, Pevzner & Tesler, 2011) formed by combining overlapping short reads. With only single end reads, disambiguating repeat regions, which tangle the de Bruijn graph, remains challenging. Paired end reads with insert sizes of a few hundred bases provide some additional information for repeat disambiguation; even more useful are read pairs with a large insert size of several kilobases, such as the Illumina Nextera mate pairs (Illumina, 2012). Many assemblers incorporate mate pair insert size information into either both the contig assembly and scaffolding processes (Bankevich et al, 2012), or just

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call