Abstract

BackgroundOngoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into k-mers, but requiring fast algorithms for the computation of suffix-prefix matches among all pairs of reads.ResultsHere we present efficient methods for the construction of a string graph from a set of sequencing reads. Our approach employs suffix sorting and scanning methods to compute suffix-prefix matches. Transitive edges are recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges only.ConclusionsOur suffix-prefix match determination and string graph construction algorithms have been implemented in the software package Readjoiner. Comparison with existing string graph-based assemblers shows that Readjoiner is faster and more space efficient. Readjoiner is available at http://www.zbh.uni-hamburg.de/readjoiner.

Highlights

  • Ongoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers

  • Sequence analysis software tools developed only a few years ago are often unable to deal with such large amounts of short reads: This has led to a gap between the

  • The presented methods for constructing the string graph and the subsequent computation of contigs have been implemented in a sequence assembler named Readjoiner, which is part of the GenomeTools software suite [21]

Read more

Summary

Introduction

Ongoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into k-mers, but requiring fast algorithms for the computation of suffix-prefix matches among all pairs of reads. The de novo sequence assembly problem is to reconstruct a target sequence from a set of sequence reads. The classical approach to de novo assembly consists of three phases: overlap, layout and consensus. Suffix-prefix matches among all pairs of sequence reads are computed, and turned into an overlap graph [1]. In the consensus phase the target sequence is reconstructed, by selecting a base for each position. Sequence analysis software tools developed only a few years ago are often unable to deal with such large amounts of short reads: This has led to a gap between the

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.