Abstract

de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.

Highlights

  • DNA sequencing allowed previously unobservable phenomena to be studied on an unprecedented scale

  • We note that the query time with Burrows-Wheeler Transform [57] (BWT)-based k-mer indexes is dominated by Discussion We introduced the concept of simplitigs, a light-weight alternative to unitigs, and demonstrated that simplitigs constitute a compact, efficient, and scalable representation of de Bruijn graphs for various types of genomic datasets

  • We studied applications to bacterial pan-genomics and showed that the utility of simplitigs compared to unitigs grows as more data are available

Read more

Summary

Introduction

DNA sequencing allowed previously unobservable phenomena to be studied on an unprecedented scale. Sequencing capacity has grown faster than computer performance, memory, and available human resources, and huge amounts of sequence data are available. One elegant solution for genomic data representation is de Bruijn graphs. These build on the concept of k-mers, which are substrings of a fixed length k of the genomic strings to be represented, such as sequencing reads, genomes, and transcriptomes. For a given k-mer set, the corresponding de Bruijn graph is a directed graph with the kmers being vertices and k − 1 long overlaps between pairs of these k-mers indicating edges. If k is chosen appropriately, de Bruijn graphs capture substantial information about the sequenced molecules as these correspond to some walks in the graph

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call