Bitpacking techniques for indexing genomes: II.Enhanced suffix arrays.

Thomas D Wu

doi:10.1186/s13015-016-0068-6

Abstract

BackgroundSuffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements.ResultsWe evaluate various bitpacking techniques that store integers in fewer than 32 bits each, as well as bytecoding methods that reserve a single byte per integer whenever possible. Our results on the fly, chicken, and human genomes show that bytecoding with an exception guide array is the fastest method for retrieving auxiliary information. Genomic searching can be further accelerated using a data structure called a discriminating character array, which reduces memory accesses to the suffix array and the genome string. Finally, integrating storage of the auxiliary and discriminating character arrays further speeds up genomic search.ConclusionsThe combination of exception guide arrays, a discriminating character array, and integrated data storage provide a 2- to 3-fold increase in speed for genomic searching compared with using bytecoding alone, and is 20 % faster and 40 % more space-efficient than an uncompressed ESA.Electronic supplementary materialThe online version of this article (doi:10.1186/s13015-016-0068-6) contains supplementary material, which is available to authorized users.

Highlights

Suffix arrays and their variants are used widely for representing genomes in search applications
A bytecoding scheme for the longest common prefix (LCP) and child arrays was proposed in the original paper on enhanced suffix arrays [14]. We can consider this to be a reference benchmark, and we show the results of this approach for genomes in Table 1 as ESAbyte, which is relatively fast for both counting and locating tasks
Because the bucket array is a series of successive pointers into the suffix array, it is amenable to differential coding techniques, such as those that we explored for offset arrays in our companion paper

Summary

Introduction

Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements. High-throughput sequencing [1] makes it critical to accelerate the alignment of query reads to a genome, and influences the design of genomic data structures for fast pattern search. We show how hash tables for representing genomes can be made faster by introducing novel bitpacking compression techniques, which allow for larger k-mers and higher specificity. We consider the prevailing alternative to hash tables, namely, suffix arrays [2] and related variants. Suffix arrays are used in such programs as segemehl [3], last [4], mummer [5], reputer [6], star [7], and as an initial stage in recent versions of gsnap [8], which employs hash tables for more complex alignments

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms for Molecular Biology	Publication Date: Apr 23, 2016
Citations: 2	License type: cc-by

R Discovery Prime

R Discovery Prime

Bitpacking techniques for indexing genomes: II.Enhanced suffix arrays.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology

Lead the way for us

Similar Papers

An Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays for Alphabets of Non-negligible Size
Dong Kyue Kim ... Heejin Park
-
Dong Kyue Kim, et. al.Dong Kyue Kim ... Heejin Park
01 Jan 2004
01 Jan 2004

Solving All-Pairs Suffix Prefix – Theory and Practice
Maan Haj Rachid ... Qutaibah Malluhi
-
Maan Haj Rachid, et. al.Maan Haj Rachid ... Qutaibah Malluhi
01 Jan 2015
01 Jan 2015

Linearized Suffix Tree: an Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays
Dong Kyue Kim ... Minhwan Kim
Algorithmica | VOL. 52
Dong Kyue Kim, et. al.Dong Kyue Kim ... Minhwan Kim
24 Oct 2007
Algorithmica | VOL. 52

Replacing suffix trees with enhanced suffix arrays
Mohamed Ibrahim Abouelhoda ... Enno Ohlebusch
Journal of Discrete Algorithms | VOL. 2
Mohamed Ibrahim Abouelhoda, et. al.Mohamed Ibrahim Abouelhoda ... Enno Ohlebusch
13 Feb 2004
Journal of Discrete Algorithms | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Bitpacking techniques for indexing genomes: II.Enhanced suffix arrays.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology