Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Rayan Chikhi,Guillaume Rizk

doi:10.1186/1748-7188-8-22

Abstract

BackgroundThe de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB).ResultsWe propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives.ConclusionsAn assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

Highlights

The de Bruijn graph data structure is widely used in next-generation sequencing (NGS)
It was first introduced to perform de novo assembly of DNA sequences [1]. It has recently been used in a wider set of applications: de novo mRNA [2] and metagenome [3] assembly, genomic variants detection [4,5] and de novo alternative splicing calling [6]
We focus on encoding an exact representation of the de Bruijn graph that efficiently implements the following operations: 1. For any node, enumerate its neighbors 2

Summary

Introduction

The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥ 30 GB). The de Bruijn graph of a set of DNA or RNA sequences is a data structure which plays an increasingly important role in next-generation sequencing applications. It was first introduced to perform de novo assembly of DNA sequences [1] It has recently been used in a wider set of applications: de novo mRNA [2] and metagenome [3] assembly, genomic variants detection [4,5] and de novo alternative splicing calling [6]. The straightforward encoding of the de Bruijn graph for the human genome (n ≈ 2.4 · 109, k-mer size k = 27) requires 15 GB (n · k/4 bytes) of memory to store the nodes sequences alone. Graphs for much larger genomes and metagenomes cannot be constructed on a typical lab cluster, because of the prohibitive memory usage

Methods

Results

Conclusion