Data structures and compression algorithms for high-throughput sequencing technologies

Kenny Daily,Pierre Baldi,Paul Rigor,Xiaohui Xie,Scott Christley

doi:10.1186/1471-2105-11-514

Abstract

BackgroundHigh-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments. As experiments scale up, HTS technologies create new bioinformatics challenges for the storage and sharing of HTS data.ResultsWe develop data structures and compression algorithms for HTS data. A processing stage maps short sequences to a reference genome or a large table of sequences. Then the integers representing the short sequence absolute or relative addresses, their length, and the substitutions they may contain are compressed and stored using various entropy coding algorithms, including both old and new fixed codes (e.g Golomb, Elias Gamma, MOV) and variable codes (e.g. Huffman). The general methodology is illustrated and applied to several HTS data sets. Results show that the information contained in HTS files can be compressed by a factor of 10 or more, depending on the statistical properties of the data sets and various other choices and constraints. Our algorithms fair well against general purpose compression programs such as gzip, bzip2 and 7zip; timing results show that our algorithms are consistently faster than the best general purpose compression programs.ConclusionsIt is not likely that exactly one encoding strategy will be optimal for all types of HTS data. Different experimental conditions are going to generate various data distributions whereby one encoding strategy can be more effective than another. We have implemented some of our encoding algorithms into the software package GenCompress which is available upon request from the authors. With the advent of HTS technology and increasingly new experimental protocols for using the technology, sequence databases are expected to continue rising in size. The methodology we have proposed is general, and these advanced compression techniques should allow researchers to manage and share their HTS data in a more timely fashion.

Highlights

High-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments
We have presented a set of data structures and compression algorithms for high-throughput sequencing data
Any arbitrary genome sequence can be used for mapping the reads, but it is likely that the genome which most closely matches the organism for the read data will provide the best performance

Summary

Introduction

High-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments. Over the past four decades, sequencing technologies have been one of the major driving forces in the life sciences producing, for instance, the full genome sequences of thousands of viruses and bacteria, and dozens of eukaryotic organisms, from yeast to man [1]. This trend is being accentuated by modern high-throughput sequencing (HTS) technologies: several human genomes were recently produced [2,3,4,5] and a project to sequence 1,000 human genomes in the few years is under way [6].

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 14, 2010
Citations: 72	License type: cc-by

R Discovery Prime

R Discovery Prime

Data structures and compression algorithms for high-throughput sequencing technologies

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

A New Variable-Length Integer Code for Integer Representation and Its Application to Text Compression
J Nelson Raja ... S Domnic
Indian Journal of Science and Technology | VOL. 8
J Nelson Raja, et. al.J Nelson Raja ... S Domnic
01 Sep 2015
Indian Journal of Science and Technology | VOL. 8

Significant differences found in short nucleotide sequences of human intestinal metagenomes of Northern-European and Chinese Origin
Balázs Szalkai ... Vince Grolmusz
Biochimica et Biophysica Acta (BBA) - General Subjects | VOL. 1861
Balázs Szalkai, et. al.Balázs Szalkai ... Vince Grolmusz
21 Jun 2016
Biochimica et Biophysica Acta (BBA) - General Subjects | VOL. 1861

Identification of two short internal ribosome entry sites selected from libraries of random oligonucleotides.
Geoffrey C Owens ... Gerald M Edelman
Proceedings of the National Academy of Sciences | VOL. 98
Geoffrey C Owens, et. al.Geoffrey C Owens ... Gerald M Edelman
13 Feb 2001
Proceedings of the National Academy of Sciences | VOL. 98

Assembling millions of short DNA sequences using SSAKE
René L Warren ... Robert A Holt
Bioinformatics | VOL. 23
René L Warren, et. al.René L Warren ... Robert A Holt
08 Dec 2006
Bioinformatics | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data structures and compression algorithms for high-throughput sequencing technologies

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics