SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores.

Jintao Meng,Pavan Balaji,Bingqiang Wang,Shengzhong Feng,Yanjie Wei

doi:10.1186/1471-2105-15-s9-s2

Abstract

BackgroundThere is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data.ResultsThis paper presents a highly scalable assembler named as SWAP-Assembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multi-step bi-directed graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAP-Assembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAP-Assembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets.ConclusionsIn this paper, we presented a highly scalable and efficient genome assembly software, SWAP-Assembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at: https://sourceforge.net/projects/swapassembler

Highlights

There is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data
The parallelization is achieved by distributing k-mers to multi-servers to build a distributed de Bruijn graph, and error removal and graph reduction are implemented over MPI communication primitives
By comparing with several state-of-the-art sequential and parallel assemblers, such as Velvet [22], SOAPdenovo [29], Pasha [16], ABySS [14] and Ray [15], we evaluate the scalability, quality of contigs in terms of N50, error rate and coverage for SWAP-Assembler

Summary

Results

SWAP-Assembler is a highly scalable and efficient genome assembler using multi-step bi-directed graph (MSG). By comparing with several state-of-the-art sequential and parallel assemblers, such as Velvet [22], SOAPdenovo [29], Pasha [16], ABySS [14] and Ray [15], we evaluate the scalability, quality of contigs in terms of N50, error rate and coverage for SWAP-Assembler. The evaluation shows that our assembler can scales up to 2048 cores, which is much better than other parallel assemblers, and the quality of contigs generated by SWAP-Assembler is the best in terms of error rate for several small datasets and N50 size for two larger data sets

Conclusions

Background

Methods

Conclusion

Schatz MC

12. McPherson JD