Distributed RMI-DBG model: Scalable iterative de Bruijn graph algorithm for short read genome assembly problem

Zeinab Zare Hosseini,Shekoufeh Kolahdouz Rahimi,Esmaeil Forouzan,Ahmad Baraani

doi:10.1016/j.eswa.2023.120859

Zeinab Zare Hosseini, Shekoufeh Kolahdouz Rahimi + Show 2 more

Open Access

https://doi.org/10.1016/j.eswa.2023.120859

Copy DOI

Abstract

Genome assembly is the computational process of merging short parts of DNA into larger sequences called contigs. Rapid growth of high-throughput genome sequencing technologies and production of large amount of data have led to the genome assembly paradigms shift from shared memory to distributed memory systems in the recent years. Among the existing assembly algorithms, the iterative de Bruijn Graph is a leading approach for assembling short reads. This approach by exploring the advantages of all k between kmin to kmax, generates high quality assembly. However, the assembly operations are decelerated especially in the larger data sets. RMI-DBG is an agile iterative de Bruijn Graph algorithm that has the computational efficiency of de Bruijn Graph methods and the flexibility of overlap-based algorithms. In this paper, we suggest a distributed iterative DBG model based on RMI-DBG, named DRMI-DBG. The proposed idea is to address the problem of parallelizing the de Bruijn Graph construction and processing on distributed memory systems at each iteration of the algorithm. DRMI-DBG is a scalable iterative DBG framework over a Hadoop cluster by applying the power of Spark (a batch processing engine) and Giraph (a distributed big graph processing system). Experiments on a variety of real data sets show that DRMI-DBG accelerates the performance of RMI-DBG algorithm and IDBA-UD assembler up to 4.8 times with comparable or better results in the quality of the assembly. For more evaluation, performance of the proposed model is compared to ScalaDBG, as the state-of-the-art distributed assembler based on the multiple k-values strategy.

Full Text