Parallelized short read assembly of large genomes using de Bruijn graphs

Yongchao Liu,Douglas L Maskell,Bertil Schmidt

doi:10.1186/1471-2105-12-354

Yongchao Liu, Douglas L Maskell + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-12-354

Copy DOI

Abstract

BackgroundNext-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads.ResultsWe present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources.ConclusionsDeveloping parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources.

Highlights

Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers
We present PASHA, a parallelized short read assembler for large genomes based on de Bruijn graphs
PASHA is a parallelized algorithm for large genome assembly, which overcomes the memory and execution speed constraints by using hybrid computing architectures consisting of shared-memory multi-core Central Processing Unit (CPU) and distributed-memory compute clusters

Summary

Introduction

Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. The emergence and widespread adoption of massively parallel next-generation sequencing technologies has given rise to the explosive increase in DNA sequencing throughput at a substantially lower unit cost of data, compared to conventional Sanger capillary-based technologies These technologies introduce some new challenges to the assembly of large genomes due to two factors: (i) short read length and (ii) high throughput. SOAPdenovo employs a de Brujin graph data structure similar to that of Velvet, but uses a multi-threaded design to parallelize compute-intensive portions on shared-memory architectures Besides those algorithms that use directed de Brujin graphs, YAGA [12] employs a bi-directed string graph, represented by a set of edges, and produces contigs through path walking using a variation of the classic parallel list ranking problem. To assemble the E.coli dataset (see the Results and Discussion section), its execution time (496 seconds) using 256 CPUs of a Blue Gene/L system is longer than PASHA (325 seconds) on a single CPU core (see the Results and Discussion section)

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 25, 2011
Citations: 87	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Parallelized short read assembly of large genomes using de Bruijn graphs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

FastEtch: A Fast Sketch-Based Assembler for Genomes.
Priyanka Ghosh ... Ananth Kalyanaraman
IEEE/ACM transactions on computational biology and bioinformatics | VOL. 16
Priyanka Ghosh, et. al.Priyanka Ghosh ... Ananth Kalyanaraman
11 Sep 2017
IEEE/ACM transactions on computational biology and bioinformatics | VOL. 16

Abstract 4746: Mapping the “dark matter” of cancer genome - Long repeats, complex structural variations with nanochannel technology
H Cao ... X Xun
Cancer Research | VOL. 75
H Cao, et. al.H Cao ... X Xun
01 Aug 2015
Abstract 4746: Mapping the “dark matter” of cancer genome - Long repeats, complex structural variations with nanochannel technology
H Cao ... X Xun

HAssembler: A hybrid de novo genome assembly approach for large genomes
Amit Kairi ... Atmakuri Ramakrishna Rao
The Indian Journal of Agricultural Sciences | VOL. 90
Amit Kairi, et. al.Amit Kairi ... Atmakuri Ramakrishna Rao
04 Dec 2020
The Indian Journal of Agricultural Sciences | VOL. 90

A Fast Sketch-based Assembler for Genomes
Priyanka Ghosh ... Ananth Kalyanaraman
-
Priyanka Ghosh, et. al.Priyanka Ghosh ... Ananth Kalyanaraman
02 Oct 2016
02 Oct 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Parallelized short read assembly of large genomes using de Bruijn graphs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics