Compact representation of k-mer de Bruijn graphs for genome read assembly.

Einar Andreas Rødland

doi:10.1186/1471-2105-14-313

Abstract

BackgroundProcessing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. The memory requirements for storing all k-mers in a lookup table can be demanding, even after removal of read errors, but can be alleviated by using a memory efficient data structure.ResultsThe FM-index, which is based on the Burrows–Wheeler transform, provides an efficient data structure providing a searchable index of all substrings from a set of strings, and is used to compactly represent full genomes for use in mapping reads to a genome: the memory required to store this is in the same order of magnitude as the strings themselves. However, reads from high throughput sequences mostly have high coverage and so contain the same substrings multiple times from different reads. I here present a modification of the FM-index, which I call the kFM-index, for indexing the set of k-mers from the reads. For DNA sequences, this requires 5 bit of information for each vertex of the corresponding de Bruijn subgraph, i.e. for each different k−1-mer, plus some additional overhead, typically 0.5 to 1 bit per vertex, for storing the equivalent of the FM-index for walking the underlying de Bruijn graph and reproducing the actual k-mers efficiently.ConclusionsThe kFM-index could replace more memory demanding data structures for storing the de Bruijn k-mer graph representation of sequence reads. A Java implementation with additional technical documentation is provided which demonstrates the applicability of the data structure (http://folk.uio.no/einarro/Projects/KFM-index/).

Highlights

Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads
Genomes are usually sequenced at high coverage, which means there will frequently be at least 30–50 reads covering the same region of the genome, differing primarily by sequencing errors
A common approach for simplifying the processing of the sequence data is to consider all the k-mers of the reads: i.e. all the k-substrings of the reads if we view them as strings. This set of k-strings is thought of as a subgraph of the de Bruijn graph of order k − 1: i.e. one which has vertices corresponding to all k − 1substrings and edges corresponding to the k-substrings

Summary

Introduction

Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. A common approach for simplifying the processing of the sequence data is to consider all the k-mers of the reads: i.e. all the k-substrings of the reads if we view them as strings. This set of k-strings is thought of as a subgraph of the de Bruijn graph of order k − 1: i.e. one which has vertices corresponding to all k − 1substrings and edges corresponding to the k-substrings. Direct storage of all k-mers in a single list will require k letters per k-mer, i.e. 2k bit of information for DNA sequences, which can be quite memory consuming when k is large

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 23, 2013
Citations: 39	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Compact representation of k-mer de Bruijn graphs for genome read assembly.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Applications of de Bruijn graphs in microbiome research.
Keith Dufault-Thompson ... Xiaofang Jiang
iMeta | VOL. 1
Keith Dufault-Thompson, et. al.Keith Dufault-Thompson ... Xiaofang Jiang
01 Mar 2022
iMeta | VOL. 1

Integrating long-range connectivity information into de Bruijn graphs.
Isaac Turner ... Gil Mcvean
Bioinformatics | VOL. 34
Isaac Turner, et. al.Isaac Turner ... Gil Mcvean
15 Mar 2018
Bioinformatics | VOL. 34

Spaced seed data structures
Inanc Birol ... Justin Chu
-
Inanc Birol, et. al.Inanc Birol ... Justin Chu
01 Nov 2014
01 Nov 2014

Accurate self-correction of errors in long reads using de Bruijn graphs.
Leena Salmela ... Esko Ukkonen
Bioinformatics | VOL. 33
Leena Salmela, et. al.Leena Salmela ... Esko Ukkonen
06 Jun 2016
Bioinformatics | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Compact representation of k-mer de Bruijn graphs for genome read assembly.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics