Succinct data structures for assembling large genomes

Thomas C Conway,Andrew J Bromage

doi:10.1093/bioinformatics/btq697

Abstract

Second-generation sequencing technology makes it feasible for many researches to obtain enough sequence reads to attempt the de novo assembly of higher eukaryotes (including mammals). De novo assembly not only provides a tool for understanding wide scale biological variation, but within human biomedicine, it offers a direct way of observing both large-scale structural variation and fine-scale sequence variation. Unfortunately, improvements in the computational feasibility for de novo assembly have not matched the improvements in the gathering of sequence data. This is for two reasons: the inherent computational complexity of the problem and the in-practice memory requirements of tools. In this article, we use entropy compressed or succinct data structures to create a practical representation of the de Bruijn assembly graph, which requires at least a factor of 10 less storage than the kinds of structures used by deployed methods. Moreover, because our representation is entropy compressed, in the presence of sequencing errors it has better scaling behaviour asymptotically than conventional approaches. We present results of a proof-of-concept assembly of a human genome performed on a modest commodity server.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Succinct data structures for assembling large genomes

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Journal: Bioinformatics	Publication Date: Jan 17, 2011
Citations: 131

Similar Papers

Accounting for Errors in Low Coverage High-Throughput Sequencing Data When Constructing Genetic Maps Using Biparental Outcrossed Populations.
Timothy P Bilton ... Michael A Black
Genetics | VOL. 209
Timothy P Bilton, et. al.Timothy P Bilton ... Michael A Black
27 Feb 2018
Genetics | VOL. 209

Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS
Anne-Katrin Emde ... Marcel H Schulz
Bioinformatics | VOL. 28
Anne-Katrin Emde, et. al.Anne-Katrin Emde ... Marcel H Schulz
11 Jan 2012
Bioinformatics | VOL. 28

PALMA: mRNA to genome alignments using large margin algorithms
Uta Schulze ... Gunnar Rätsch
Bioinformatics | VOL. 23
Uta Schulze, et. al.Uta Schulze ... Gunnar Rätsch
30 May 2007
Bioinformatics | VOL. 23

Application-Oriented Succinct Data Structures for Big Data
Tetsuo Shibuya
The Review of Socionetwork Strategies | VOL. 13
Tetsuo ShibuyaTetsuo Shibuya
01 Oct 2019
The Review of Socionetwork Strategies | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Succinct data structures for assembling large genomes

Abstract

Talk to us

Similar Papers

More From: Bioinformatics