Abstract

Scientists increasingly want to assemble large genomes, metagenomes, and large numbers of individual genomes. In order to meet the demand for processing these huge datasets, parallel genome assembly is a vital step. Among all the parallel genome assemblers, de Bruijn graph based ones are most popular. However, the size of de Bruijn graph is determined by the number of distinct kmers used in the algorithm, thus redundant kmers in the genome datasets donot contribute to the graph size. The scalability of genome assemblers is influenced directly by the distinct kmers in the dataset or de Bruijn graph size, rather than the input dataset size. In order to assembly large genomes, we have artificially created 16 datasets of 4 Terabytes in total from the human reference genome. The human reference genome is firstly mutated with a 5% mutation rate, and then subjected to a genome sequencing data simulator ART. The simulated datasets have linearly increasing number of distinct kmers as the size/number of the combined datasets increases. We then evaluate all five time-consuming steps of the SWAP-Assembler 2.0 (SWAP2) using these 16 simulated datasets. Compared with our previous experiment on 1000 human dataset with fixed de Bruijn graph size, the weak-scaling test shows that SWAP2 can scale well from 1024 cores using one dataset to 16,384 cores. The percentage of time usage for all five steps of SWAP2 is fixed, and total time usage is also constant. The result showed that the time usage of graph simplification occupied almost 75% of the total time usage, which will be subject to further optimization for future work.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.