Abstract

In contrast to genome assemblers that use de Bruijn graphs, those based on string graphs are able to losslessly retain information from sequence data. However, despite the advantages provided by a string graph framework in repeat detection and in maintaining read coherence, the high computational cost for constructing a string graph hinders its usability for genome assembly. Even though different algorithms have been proposed over the last decade for string graph construction, efficiency is still a challenge due to the demand for processing a large amount of sequence data generated by Next-Generation Sequencing technologies. In this paper, we provide a novel, linear time and alphabet-size-independent algorithm SOF which uses the property of irreducible edges and transitive edges to efficiently construct a string graph from an overlap graph. Experimental results show that SOF is at least 2.3 times faster than the string graph construction algorithm provided in SGA (one of the most popular string graph-based assemblers), while maintaining almost the same memory footprint as SGA. Moreover, the implementation of SOF as a subprogram in the SGA assembly pipeline will allow a user easy access to the preprocessing and postprocessing steps for genome assembly provided in SGA. Implementation: https://github.com/iqbalmorshed/sof

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call