Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Jarrod A Chapman,Daniel S Rokhsar,Sirisha Sunkara,Gary P Schroth,Shujun Luo,Isaac Ho

doi:10.1371/journal.pone.0023501

Jarrod A Chapman, Daniel S Rokhsar + Show 4 more

Open Access

https://doi.org/10.1371/journal.pone.0023501

Copy DOI

Abstract

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

Highlights

Parallel sequencing methods introduced over the past few years provide cost-effective, highly redundant sampling of genomes
These assemblers all take advantage of the deBruijn graph representation of the assembly problem [24], in which reads are decomposed into overlapping words of length k (‘‘k-mers’’), where k is a fraction of the read length
As a test set for meraculous, we report a dataset of three lanes of 75 bp paired-end shotgun for P. stipitis produced using Illumina sequencing-by-synthesis methods, with both short-range (,280 bp) and medium-range (,3.2 kbp) pairing data

Summary

Introduction

Parallel sequencing methods introduced over the past few years provide cost-effective, highly redundant sampling of genomes (reviewed in [1]). While sequencing by synthesis produces substantially shorter reads, it has lower cost per base and higher throughput [3] Such data has proven useful for re-sequencing variant genomes [4,5,6], since short reads can be readily aligned to a reference, and the error rates are low enough that variation can be detected by consistent discrepancy of the aligned short reads versus the reference. The importance of using a range of paired-end linkages to organize non-repetitive contigs into scaffolds by linking over repetitive regions was presciently emphasized by Weber and Myers [16] in the context of human whole genome shotgun sequencing This approach became the dominant paradigm for genome sequencing in the last decade. These assemblers all take advantage of the deBruijn graph representation of the assembly problem [24], in which reads are decomposed into overlapping words of length k (‘‘k-mers’’), where k is a fraction of the read length

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS ONE	Publication Date: Aug 18, 2011
Citations: 263	License type: CC0 1.0

R Discovery Prime

R Discovery Prime

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Cancer genomics: new software tools making sequencing more accessible.
En-Guo Chen ... Yan Lu
Personalized Medicine | VOL. 11
En-Guo Chen, et. al.En-Guo Chen ... Yan Lu
01 Mar 2014
Personalized Medicine | VOL. 11

Filtering with alignment free distances for high throughput DNA reads assembly
Maria C De Cola ... Giovanni Felici
EMBnet.journal | VOL. 18
Maria C De Cola, et. al.Maria C De Cola ... Giovanni Felici
09 Nov 2012
EMBnet.journal | VOL. 18

Next-Generation Sequencing Strategies Enable Routine Detection of Balanced Chromosome Rearrangements for Clinical Diagnostics and Genetic Research
Michael E Talkowski ... James F Gusella
The American Journal of Human Genetics | VOL. 88
Michael E Talkowski, et. al.Michael E Talkowski ... James F Gusella
01 Apr 2011
The American Journal of Human Genetics | VOL. 88

A report on the 2009 SIG on short read sequencing and algorithms (Short-SIG)
Michael Brudno ... Francisco M De La Vega
Bioinformatics | VOL. 25
Michael Brudno, et. al.Michael Brudno ... Francisco M De La Vega
24 Sep 2009
A report on the 2009 SIG on short read sequencing and algorithms (Short-SIG)
Michael Brudno ... Francisco M De La Vega

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE