Abstract

Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed.Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.Contact: carlk@cs.cmu.edu.Supplementary information: Supplementary data are available at Bioinformatics online.

Highlights

  • The size of short-read sequence collections is often a stumbling block to rapid analysis

  • Our compression approach is composed of several different encoding techniques that are applied to the input reads as a set

  • CRAM (Fritz et al, 2011) is designed for compressing BAM files. To adapt it to compress sequences, read files were aligned with Bowtie (Langmead et al, 2009) using –best -q -y –sam to an index built from the same transcriptome as used for path encoding

Read more

Summary

Introduction

The size of short-read sequence collections is often a stumbling block to rapid analysis. Such BAM compressors may increase the raw size of the data since all the alignment information must be preserved Another reference-based compressor, fastqz (Bonfield and Mahoney, 2013), attempts to compress sequences directly using its own alignment scheme without first creating a BAM file. The arithmetic coder uses a fixed-length context to select a conditional distribution for the following base This scheme is efficient but has the drawback that at the start of each read, there is insufficient context to apply the model. The bit tree scheme for storing sets of short sequences (kmers) is of independent interest as the need to transmit and store collections of kmers is increasingly common in de Bruijn-graph-based genome assembly, metagenomic classification (Wood and Salzberg, 2014) and other analyses (Patro et al, 2014)

Overview
The path encoding problem
Encoding the starts of the reads with a bit tree
Arithmetic coding of read tails
Initializing and updating the sequence generative model
Other considerations
Implementation
Comparison with other methods
Path encoding effectively compresses RNA-seq reads
Encoding of the read tails represents the bulk of the compressed file
Priming the statistical model results in improved compression
Encoding and decoding path-encoded files is fast
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call