Abstract

BackgroundHash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.ResultsWe propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.ConclusionsOur BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.Electronic supplementary materialThe online version of this article (doi:10.1186/s13015-016-0069-5) contains supplementary material, which is available to authorized users.

Highlights

  • Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size

  • A hash table represents a genome sequence as multiple lists of genomic positions, one list for each possible k-mer, or oligomer of some preselected length k

  • Hash tables are used by various programs, including blast [2], patternhunter [3], shrimp [4], blat [5], NextGenMap [6], and gmap [7], to identify short oligomer matches between a read and the genome

Read more

Summary

Results

We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values

Background
Methods
Experimental setup
Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.