Abstract

MotivationTechnological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata.ResultsWe present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain.Availability and implementationWe provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • The revolution of high-throughput DNA sequencing has created an unprecedented need for efficient representations of large amounts of biological sequences

  • We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters

  • 2.3.2 Probabilistic column compression with Bloom filters For cases where a lossy compression scheme with moderate loss of accuracy will suffice in place of fully lossless compression, we explore a probabilistic compression of the annotation matrix as a near-exact compromise

Read more

Summary

Introduction

The revolution of high-throughput DNA sequencing has created an unprecedented need for efficient representations of large amounts of biological sequences. Note that we imply no additional restrictions on the graph coloring (i.e., neighboring edges are allowed to have same colorings) Another important application of colored de Bruijn graphs is building an efficient representation and indexing of multiple genomes, forming a so-called pan-genome store (Myers et al, 2017). The first group contains approaches such as Bloom Filter Tries (Holley et al, 2016) for pan-genome representation, deBGR (Pandey et al, 2017a) that encodes a weighted de Bruijn graph, or Split Sequence Bloom Trees (Solomon and Kingsford, 2017) that index short read datasets based on a hierarchically structured set of Bloom filters. We further reduce the necessary storage requirements of the individual filters by maintaining weak requirements on their respective false-positive rates, which is subsequently corrected for using neighborhood information in the graph Both proposed techniques for color compression take advantage of the underlying sequence graph, they impose no restrictions on its topology

Approach
Preliminaries and notation
Graph representation
Graph coloring compression
Evaluation and applications
Graph topology affects compression ratios
Properties of compression methods
Wavelet tries and Bloom filters improve on state-ofthe-art compression ratios
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.