Reference-based compression of short-read sequences using path encoding.

Carl Kingsford,Rob Patro

doi:10.1093/bioinformatics/btv071

Abstract

Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed.Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.Contact: carlk@cs.cmu.edu.Supplementary information: Supplementary data are available at Bioinformatics online.

Highlights

The size of short-read sequence collections is often a stumbling block to rapid analysis
Our compression approach is composed of several different encoding techniques that are applied to the input reads as a set
CRAM (Fritz et al, 2011) is designed for compressing BAM files. To adapt it to compress sequences, read files were aligned with Bowtie (Langmead et al, 2009) using –best -q -y –sam to an index built from the same transcriptome as used for path encoding

Summary

Introduction

The size of short-read sequence collections is often a stumbling block to rapid analysis. Such BAM compressors may increase the raw size of the data since all the alignment information must be preserved Another reference-based compressor, fastqz (Bonfield and Mahoney, 2013), attempts to compress sequences directly using its own alignment scheme without first creating a BAM file. The arithmetic coder uses a fixed-length context to select a conditional distribution for the following base This scheme is efficient but has the drawback that at the start of each read, there is insufficient context to apply the model. The bit tree scheme for storing sets of short sequences (kmers) is of independent interest as the need to transmit and store collections of kmers is increasingly common in de Bruijn-graph-based genome assembly, metagenomic classification (Wood and Salzberg, 2014) and other analyses (Patro et al, 2014)

Overview

The path encoding problem

Encoding the starts of the reads with a bit tree

Arithmetic coding of read tails

Initializing and updating the sequence generative model

Other considerations

Implementation

Comparison with other methods

Path encoding effectively compresses RNA-seq reads

Encoding of the read tails represents the bulk of the compressed file

Priming the statistical model results in improved compression

Encoding and decoding path-encoded files is fast

Discussion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer applications in the biosciences : CABIOS	Publication Date: Feb 2, 2015
Citations: 52	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Reference-based compression of short-read sequences using path encoding.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer applications in the biosciences : CABIOS

Lead the way for us

Similar Papers

Short Read (Next-Generation) Sequencing
Jaya Punetha ... Eric P Hoffman
Circulation. Cardiovascular genetics | VOL. 6
Jaya Punetha, et. al.Jaya Punetha ... Eric P Hoffman
14 Jul 2013
Circulation. Cardiovascular genetics | VOL. 6

Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes
Dennis K Gascoigne ... Paulo P Amaral
Computer applications in the biosciences : CABIOS | VOL. 28
Dennis K Gascoigne, et. al.Dennis K Gascoigne ... Paulo P Amaral
07 Oct 2012
Computer applications in the biosciences : CABIOS | VOL. 28

Microindel detection in short-read sequence data
Peter Krawitz ... Sebastian Bauer
Computer applications in the biosciences : CABIOS | VOL. 26
Peter Krawitz, et. al.Peter Krawitz ... Sebastian Bauer
09 Feb 2010
Computer applications in the biosciences : CABIOS | VOL. 26

FASTAptamer: A Bioinformatic Toolkit for High-throughput Sequence Analysis of Combinatorial Selections.
Khalid K Alam ... Jonathan L Chang
Molecular Therapy—Nucleic Acids | VOL. 4
Khalid K Alam, et. al.Khalid K Alam ... Jonathan L Chang
01 Jan 2015
Molecular Therapy—Nucleic Acids | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Reference-based compression of short-read sequences using path encoding.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer applications in the biosciences : CABIOS