Abstract

BackgroundNext-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads.ResultsWe present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources.ConclusionsDeveloping parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources.

Highlights

  • Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers

  • We present PASHA, a parallelized short read assembler for large genomes based on de Bruijn graphs

  • PASHA is a parallelized algorithm for large genome assembly, which overcomes the memory and execution speed constraints by using hybrid computing architectures consisting of shared-memory multi-core Central Processing Unit (CPU) and distributed-memory compute clusters

Read more

Summary

Introduction

Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. The emergence and widespread adoption of massively parallel next-generation sequencing technologies has given rise to the explosive increase in DNA sequencing throughput at a substantially lower unit cost of data, compared to conventional Sanger capillary-based technologies These technologies introduce some new challenges to the assembly of large genomes due to two factors: (i) short read length and (ii) high throughput. SOAPdenovo employs a de Brujin graph data structure similar to that of Velvet, but uses a multi-threaded design to parallelize compute-intensive portions on shared-memory architectures Besides those algorithms that use directed de Brujin graphs, YAGA [12] employs a bi-directed string graph, represented by a set of edges, and produces contigs through path walking using a variation of the classic parallel list ranking problem. To assemble the E.coli dataset (see the Results and Discussion section), its execution time (496 seconds) using 256 CPUs of a Blue Gene/L system is longer than PASHA (325 seconds) on a single CPU core (see the Results and Discussion section)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.