Integrating long-range connectivity information into de Bruijn graphs.

Isaac Turner,Gil Mcvean,Kiran V Garimella,Zamin Iqbal,Bonnie Berger

doi:10.1093/bioinformatics/bty157

Abstract

MotivationThe de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data.ResultsWe present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes.Availability and implementationLinked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

Most efforts to discover genetic variation in populations begin with alignment of high-throughput sequencing (HTS) data to a highquality reference genome for the organism under study
We demonstrate its value by application to variant discovery and characterization of genomic context for drug resistance genes in Klebsiella pneumoniae
We propose a new assembly graph data structure called the Linked de Bruijn Graph (LdBG)

Summary

Introduction

Most efforts to discover genetic variation in populations begin with alignment of high-throughput sequencing (HTS) data to a highquality reference genome for the organism under study This approach works well for regions with low divergence from the reference haplotype. In 13 isolates of the diploid coccolithophore Emiliania huxleyi, 8–40 Mbp of the approximately 142 Mbp genome were found to be isolate-specific; up to 25% of genes were found to be absent from the reference sequence (Read et al, 2013) In these scenarios, reads may fail to map to the reference, preventing the analyst from inspecting biologically interesting variation. There is a penalty for this approach: long-range information in the read is sacrificed This is problematic as genomes tend to have many repetitive regions and without context it is often not possible to determine the origin of a random k-mer (Miller et al, 2010; Pevzner, 2004). We consider the possibility of using such structures for regular analysis of human-scale genomes

Definitions and notation

Assembly graphs

String graphs

Other approaches for preserving connectivity

The linked de Bruijn graph

Read-to-graph alignment

Link annotation

Implementation

Multi-coloured linked de Bruijn graphs

Equivalence of LdBG and input string

Correcting errors in reads

Sensitivity to word length

Comparison to other assemblers

Results: applications

Large-variant discovery

Reference-link guided assembly

Scalability

Findings

Discussion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer applications in the biosciences : CABIOS	Publication Date: Mar 15, 2018
Citations: 60	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Integrating long-range connectivity information into de Bruijn graphs.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer applications in the biosciences : CABIOS

Lead the way for us

Similar Papers

Efficient de novo assembly of large genomes using compressed data structures
Jared T Simpson ... Richard Durbin
Genome research | VOL. 22
Jared T Simpson, et. al.Jared T Simpson ... Richard Durbin
07 Dec 2011
Genome research | VOL. 22

Bit-parallel sequence-to-graph alignment.
Mikko Rautiainen ... Inanc Birol
Computer applications in the biosciences : CABIOS | VOL. 35
Mikko Rautiainen, et. al.Mikko Rautiainen ... Inanc Birol
09 Mar 2019
Computer applications in the biosciences : CABIOS | VOL. 35

Spaced seed data structures
Inanc Birol ... Hamid Mohamadi
-
Inanc Birol, et. al.Inanc Birol ... Hamid Mohamadi
01 Nov 2014
01 Nov 2014

Integration of string and de Bruijn graphs for genome assembly.
Yao-Ting Huang ... Chen-Fu Liao
Computer applications in the biosciences : CABIOS | VOL. 32
Yao-Ting Huang, et. al.Yao-Ting Huang ... Chen-Fu Liao
10 Jan 2016
Computer applications in the biosciences : CABIOS | VOL. 32

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integrating long-range connectivity information into de Bruijn graphs.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer applications in the biosciences : CABIOS