Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly.

Priti Kumari,Konstantinos Krampis,Vahan Simonyan,Raja Mazumder

doi:10.12688/f1000research.6016.1

Abstract

Background: The transition to Next Generation sequencing (NGS) sequencing technologies has had numerous applications in Plant, Microbial and Human genomics during the past decade. However, NGS sequencing trades high read throughput for shorter read length, increasing the difficulty for genome assembly. This research presents a comparison of traditional versus Cloud computing-based genome assembly software, using as examples the Velvet and Contrail assemblers and reads from the genome sequence of the zebrafish (Danio rerio) model organism.Results: The first phase of the analysis involved a subset of the zebrafish data set (2X coverage) and best results were obtained using K-mer size of 65, while it was observed that Velvet takes less time than Contrail to complete the assembly. In the next phase, genome assembly was attempted using the full dataset of read coverage 192x and while Velvet failed to complete on a 256GB memory compute server, Contrail completed but required 240hours of computation.Conclusion: This research concludes that for deciding on which assembler software to use, the size of the dataset and available computing hardware should be taken into consideration. For a relatively small sequencing dataset, such as microbial or small eukaryotic genome, the Velvet assembler is a good option. However, for larger datasets Velvet requires large-memory compute servers in the order of 1000GB or more. On the other hand, Contrail is implemented using Hadoop, which performs the assembly in parallel across nodes of a compute cluster. Furthermore, Hadoop clusters can be rented on-demand from Cloud computing providers, and therefore Contrail can provide a simple and cost effective way for genome assembly of data generated at laboratories that lack the infrastructure or funds to build their own clusters.

Highlights

The transition to Generation sequencing (NGS) sequencing technologies has had numerous applications in Plant, Microbial and Human genomics during the past decade
The Sanger technique is based on chain termination by dideoxynucleotides triphosphate during Polymerase Chain Reaction (PCR) elongation reactions
The reason is that at larger K-mer lengths, the de Bruijn Graph (DBG) algorithm connects two K-mers with an edge on the assembly graph only if K-1 length suffix of the first K-mer is same as K-1 length prefix of the second K-mer

Summary

Introduction

The transition to Generation sequencing (NGS) sequencing technologies has had numerous applications in Plant, Microbial and Human genomics during the past decade. Genome assembly was attempted using the full dataset of read coverage 192x and while Velvet failed to complete on a 256GB memory compute server, Contrail completed but required 240hours of computation. Hadoop clusters can be rented on-demand from Cloud computing providers, and Contrail can provide a simple and cost effective way for genome assembly of data generated at laboratories that lack the infrastructure or funds to build their own clusters. The automated Sanger method had dominated the industry for almost two decades, with sequencing applications and broad demand for the technology in genome variation studies, comparative genomics, evolution, forensics, diagnostic and applied therapeutics, it was still limiting due to its high cost and labor intensive process[5]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: F1000Research	Publication Date: Jan 22, 2015
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research

Lead the way for us

Similar Papers

PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly
Xing Liu ... Henning Meyerhenke
IEEE Transactions on Parallel and Distributed Systems | VOL. 24
Xing Liu, et. al.Xing Liu ... Henning Meyerhenke
01 May 2013
IEEE Transactions on Parallel and Distributed Systems | VOL. 24

Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.
Aarti Desai ... Shu-Dong Zhang
PLoS ONE | VOL. 8
Aarti Desai, et. al.Aarti Desai ... Shu-Dong Zhang
12 Apr 2013
PLoS ONE | VOL. 8

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.
Jamison M Mccorrison ... Indresh Singh
BMC bioinformatics | VOL. 15
Jamison M Mccorrison, et. al.Jamison M Mccorrison ... Indresh Singh
19 Nov 2014
BMC bioinformatics | VOL. 15

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
Peng Zeng ... Tinggan Zhou
Chinese Medicine | VOL. 17
Peng Zeng, et. al.Peng Zeng ... Tinggan Zhou
09 Aug 2022
Chinese Medicine | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research