A space and time-efficient index for the compacted colored de Bruijn graph

Fatemeh Almodaresi,Avi Srivastava,Rob Patro,Hirak Sarkar

doi:10.1093/bioinformatics/bty292

Fatemeh Almodaresi, Avi Srivastava + Show 2 more

Open Access

https://doi.org/10.1093/bioinformatics/bty292

Copy DOI

Journal: Bioinformatics	Publication Date: Jun 27, 2018
Citations: 76	License type: CC BY-NC 4.0

Affiliation: Stony Brook University

Abstract

MotivationIndexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.ResultsWe present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences.Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.Availability and implementationpufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish.Supplementary information Supplementary data are available at Bioinformatics online.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A space and time-efficient index for the compacted colored de Bruijn graph

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

Integrating long-range connectivity information into de Bruijn graphs.
Isaac Turner ... Zamin Iqbal
Bioinformatics | VOL. 34
Isaac Turner, et. al.Isaac Turner ... Zamin Iqbal
15 Mar 2018
Bioinformatics | VOL. 34

Constructing suffix arrays for multi-dimensional matrices
Dong Kyue Kim ... Yoo Ah Kim
-
Dong Kyue Kim, et. al.Dong Kyue Kim ... Yoo Ah Kim
01 Jan 1998
01 Jan 1998

An Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays for Alphabets of Non-negligible Size
Dong Kyue Kim ... Jeong Eun Jeon
-
Dong Kyue Kim, et. al.Dong Kyue Kim ... Jeong Eun Jeon
01 Jan 2004
01 Jan 2004

Construction of a de Bruijn Graph for Assembly from a Truncated Suffix Tree
Bastien Cazaux ... Thierry Lecroq
-
Bastien Cazaux, et. al.Bastien Cazaux ... Thierry Lecroq
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A space and time-efficient index for the compacted colored de Bruijn graph

Abstract

Talk to us

Similar Papers

More From: Bioinformatics