Abstract

MotivationThe de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data.ResultsWe present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes.Availability and implementationLinked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Most efforts to discover genetic variation in populations begin with alignment of high-throughput sequencing (HTS) data to a highquality reference genome for the organism under study

  • We demonstrate its value by application to variant discovery and characterization of genomic context for drug resistance genes in Klebsiella pneumoniae

  • We propose a new assembly graph data structure called the Linked de Bruijn Graph (LdBG)

Read more

Summary

Introduction

Most efforts to discover genetic variation in populations begin with alignment of high-throughput sequencing (HTS) data to a highquality reference genome for the organism under study This approach works well for regions with low divergence from the reference haplotype. In 13 isolates of the diploid coccolithophore Emiliania huxleyi, 8–40 Mbp of the approximately 142 Mbp genome were found to be isolate-specific; up to 25% of genes were found to be absent from the reference sequence (Read et al, 2013) In these scenarios, reads may fail to map to the reference, preventing the analyst from inspecting biologically interesting variation. There is a penalty for this approach: long-range information in the read is sacrificed This is problematic as genomes tend to have many repetitive regions and without context it is often not possible to determine the origin of a random k-mer (Miller et al, 2010; Pevzner, 2004). We consider the possibility of using such structures for regular analysis of human-scale genomes

Definitions and notation
Assembly graphs
String graphs
Other approaches for preserving connectivity
The linked de Bruijn graph
Read-to-graph alignment
Link annotation
Implementation
Multi-coloured linked de Bruijn graphs
Equivalence of LdBG and input string
Correcting errors in reads
Sensitivity to word length
Comparison to other assemblers
Results: applications
Large-variant discovery
Reference-link guided assembly
Scalability
Findings
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call