Efficient de novo assembly of large genomes using compressed data structures

Jared T Simpson,Richard Durbin

doi:10.1101/gr.126953.111

Abstract

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient de novo assembly of large genomes using compressed data structures

Abstract

Talk to us

Similar Papers

More From: Genome Research

Lead the way for us

Journal: Genome Research	Publication Date: Dec 7, 2011
Citations: 747

Similar Papers

Integrating long-range connectivity information into de Bruijn graphs.
Isaac Turner ... Zamin Iqbal
Bioinformatics | VOL. 34
Isaac Turner, et. al.Isaac Turner ... Zamin Iqbal
15 Mar 2018
Bioinformatics | VOL. 34

A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer.
Feng Zhang ... Xiangke Liao
Interdisciplinary sciences, computational life sciences | VOL. 8
Feng Zhang, et. al.Feng Zhang ... Xiangke Liao
24 Sep 2015
Interdisciplinary sciences, computational life sciences | VOL. 8

FSG: Fast String Graph Construction for De Novo Assembly.
Paola Bonizzoni ... Raffaella Rizzi
Journal of Computational Biology | VOL. 24
Paola Bonizzoni, et. al.Paola Bonizzoni ... Raffaella Rizzi
17 Jul 2017
Journal of Computational Biology | VOL. 24

Integration of string and de Bruijn graphs for genome assembly.
Yao-Ting Huang ... Chen-Fu Liao
Bioinformatics | VOL. 32
Yao-Ting Huang, et. al.Yao-Ting Huang ... Chen-Fu Liao
10 Jan 2016
Bioinformatics | VOL. 32

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient de novo assembly of large genomes using compressed data structures

Abstract

Talk to us

Similar Papers

More From: Genome Research