Colored De Bruijn Graph Research Articles

The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.

Read full abstract

MotivationIndexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.ResultsWe present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences.Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.Availability and implementationpufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish.Supplementary information Supplementary data are available at Bioinformatics online.

Read full abstract

Colored De Bruijn Graph Research Articles

Articles published on Colored De Bruijn Graph

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.

Where the patterns are: repetition-aware compression for colored de Bruijn graphs ⋆.

Graphite: painting genomes using a colored de Bruijn graph.

Pangenome-spanning epistasis and coselection analysis via de Bruijn graphs.

Compression algorithm for colored de Bruijn graphs

Fulgor: a fast and compact k-mer index for large-scale matching and color queries.

MkcDBGAS: a reference-free approach to identify comprehensive alternative splicing events in a transcriptome.

Compression Algorithm for Colored de Bruijn Graphs.

Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT.

Efficient Colored de Bruijn Graph for Indexing Reads.

Sequence-based pangenomic core detection

An incrementally updatable and scalable system for large-scale sequence search using the Bentley-Saxe transformation.

Population-scale detection of non-reference sequence variants using colored de Bruijn graphs.

Detecting high-scoring local alignments in pangenome graphs.

Somatic variant analysis of linked-reads sequencing data with Lancet.

Detection of simple and complex de novo mutations with multiple reference sequences.

Alignment- and reference-free phylogenomics with colored de Bruijn graphs

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

Building large updatable colored de Bruijn graphs via merging.

A space and time-efficient index for the compacted colored de Bruijn graph

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Colored De Bruijn Graph Research Articles

Articles published on Colored De Bruijn Graph

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.

Where the patterns are: repetition-aware compression for colored de Bruijn graphs ⋆.

Graphite: painting genomes using a colored de Bruijn graph.

Pangenome-spanning epistasis and coselection analysis via de Bruijn graphs.

Compression algorithm for colored de Bruijn graphs

Fulgor: a fast and compact k-mer index for large-scale matching and color queries.

MkcDBGAS: a reference-free approach to identify comprehensive alternative splicing events in a transcriptome.

Compression Algorithm for Colored de Bruijn Graphs.

Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT.

Efficient Colored de Bruijn Graph for Indexing Reads.

Sequence-based pangenomic core detection

An incrementally updatable and scalable system for large-scale sequence search using the Bentley-Saxe transformation.

Population-scale detection of non-reference sequence variants using colored de Bruijn graphs.

Detecting high-scoring local alignments in pangenome graphs.

Somatic variant analysis of linked-reads sequencing data with Lancet.

Detection of simple and complex de novo mutations with multiple reference sequences.

Alignment- and reference-free phylogenomics with colored de Bruijn graphs

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

Building large updatable colored de Bruijn graphs via merging.

A space and time-efficient index for the compacted colored de Bruijn graph