Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers

Kanak Mahadik,Somali Chaterji,Saurabh Bagchi,Milind Kulkarni,Christopher Wright

doi:10.1038/s41598-019-51284-9

Kanak Mahadik, Somali Chaterji + Show 3 more

Open Access

https://doi.org/10.1038/s41598-019-51284-9

Copy DOI

Abstract

Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to “patch” a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).

Highlights

For a fixed iteration set of k-values, starting from k = kmin to k = kmax, the final graphs obtained by ScalaDBG-SP and IDBA-UD are identical
For ScalaDBG, the number of nodes were equal to the number of k-values while IDBA-UD can only run on a single node
Existing scaffolding techniques can be applied to output contigs that are obtained from ScalaDBG to get the final assembly

Summary

Motivation for ScalaDBG

While leveraging multiple k-values during the assembly improves its quality, the time taken to perform the assembly process increases significantly. The graph-construction step (Stage 2), consisting of building an accumulated DBG by iterating over several different k-values, is the bottleneck in the assembly workflow To address this concern, we propose ScalaDBG, a new parallel assembly algorithm that parallelizes Stage 2 of the assembly workflow, the iterative DBG construction process with multiple k-values. While each MPI process can independently perform graph construction on different nodes in a cluster, Open-MP threads can exploit all cores on a single node. We break the dependency in DBG creation for multiple k-values—from a purely serial process to one where the most time-consuming part (the DBG creation for individual k-values) is parallelized This innovation can be applied out-of-the-box to most DBG-based assemblers. The software package is available at https://github.com/purdue-dcsl/Scaladbg

Related Work

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Reports	Publication Date: Oct 16, 2019
Citations: 9	License type: open-access

R Discovery Prime

R Discovery Prime

Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly.
Mahfuzer Rahman Limon ... Ratul Sharker
International journal of genomics | VOL. 2017
Mahfuzer Rahman Limon, et. al.Mahfuzer Rahman Limon ... Ratul Sharker
01 Jan 2017
International journal of genomics | VOL. 2017

Scalable Genomic Assembly through Parallel de Bruijn Graph Construction for Multiple K-mers
Kanak Mahadik ... Somali Chaterji
-
Kanak Mahadik, et. al.Kanak Mahadik ... Somali Chaterji
20 Aug 2017
20 Aug 2017

Spaced seed data structures
Inanc Birol ... Justin Chu
-
Inanc Birol, et. al.Inanc Birol ... Justin Chu
01 Nov 2014
01 Nov 2014

Distributed RMI-DBG model: Scalable iterative de Bruijn graph algorithm for short read genome assembly problem
Zeinab Zare Hosseini ... Ahmad Baraani
Expert Systems With Applications | VOL. 233
Zeinab Zare Hosseini, et. al.Zeinab Zare Hosseini ... Ahmad Baraani
22 Jun 2023
Expert Systems With Applications | VOL. 233

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports