Abstract

Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to “patch” a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).

Highlights

  • For a fixed iteration set of k-values, starting from k = kmin to k = kmax, the final graphs obtained by ScalaDBG-SP and IDBA-UD are identical

  • For ScalaDBG, the number of nodes were equal to the number of k-values while IDBA-UD can only run on a single node

  • Existing scaffolding techniques can be applied to output contigs that are obtained from ScalaDBG to get the final assembly

Read more

Summary

Motivation for ScalaDBG

While leveraging multiple k-values during the assembly improves its quality, the time taken to perform the assembly process increases significantly. The graph-construction step (Stage 2), consisting of building an accumulated DBG by iterating over several different k-values, is the bottleneck in the assembly workflow To address this concern, we propose ScalaDBG, a new parallel assembly algorithm that parallelizes Stage 2 of the assembly workflow, the iterative DBG construction process with multiple k-values. While each MPI process can independently perform graph construction on different nodes in a cluster, Open-MP threads can exploit all cores on a single node. We break the dependency in DBG creation for multiple k-values—from a purely serial process to one where the most time-consuming part (the DBG creation for individual k-values) is parallelized This innovation can be applied out-of-the-box to most DBG-based assemblers. The software package is available at https://github.com/purdue-dcsl/Scaladbg

Related Work
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.