Abstract

BackgroundSuffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements.ResultsWe evaluate various bitpacking techniques that store integers in fewer than 32 bits each, as well as bytecoding methods that reserve a single byte per integer whenever possible. Our results on the fly, chicken, and human genomes show that bytecoding with an exception guide array is the fastest method for retrieving auxiliary information. Genomic searching can be further accelerated using a data structure called a discriminating character array, which reduces memory accesses to the suffix array and the genome string. Finally, integrating storage of the auxiliary and discriminating character arrays further speeds up genomic search.ConclusionsThe combination of exception guide arrays, a discriminating character array, and integrated data storage provide a 2- to 3-fold increase in speed for genomic searching compared with using bytecoding alone, and is 20 % faster and 40 % more space-efficient than an uncompressed ESA.Electronic supplementary materialThe online version of this article (doi:10.1186/s13015-016-0068-6) contains supplementary material, which is available to authorized users.

Highlights

  • Suffix arrays and their variants are used widely for representing genomes in search applications

  • A bytecoding scheme for the longest common prefix (LCP) and child arrays was proposed in the original paper on enhanced suffix arrays [14]. We can consider this to be a reference benchmark, and we show the results of this approach for genomes in Table 1 as ESAbyte, which is relatively fast for both counting and locating tasks

  • Because the bucket array is a series of successive pointers into the suffix array, it is amenable to differential coding techniques, such as those that we explored for offset arrays in our companion paper

Read more

Summary

Introduction

Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements. High-throughput sequencing [1] makes it critical to accelerate the alignment of query reads to a genome, and influences the design of genomic data structures for fast pattern search. We show how hash tables for representing genomes can be made faster by introducing novel bitpacking compression techniques, which allow for larger k-mers and higher specificity. We consider the prevailing alternative to hash tables, namely, suffix arrays [2] and related variants. Suffix arrays are used in such programs as segemehl [3], last [4], mummer [5], reputer [6], star [7], and as an initial stage in recent versions of gsnap [8], which employs hash tables for more complex alignments

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.