Abstract

We introduce a compressed suffix array representation that, on a text T of length n over an alphabet of size $$\sigma $$, can be built in O(n) deterministic time, within $$O(n\log \sigma )$$ bits of working space, and counts the number of occurrences of any pattern P in T in time $$O(|P| + \log \log _w \sigma )$$ on a RAM machine of $$w=\Omega (\log n)$$-bit words. This time is almost optimal for large alphabets ($$\log \sigma =\Theta (\log n)$$), and it outperforms all the other compressed indexes that can be built in linear deterministic time, as well as some others. The only faster indexes can be built in linear time only in expectation, or require $$\Theta (n\log n)$$ bits. For smaller alphabets, where $$\log \sigma = o(\log n)$$, we show how, by using space proportional to a compressed representation of the text, we can build in linear time an index that counts in time $$O(|P|/\log _\sigma n + \log _\sigma ^\epsilon n)$$ for any constant $$\epsilon >0$$. This is almost RAM-optimal in the typical case where $$w=\Theta (\log n)$$.

Highlights

  • The string indexing problem consists in preprocessing a string T so that, later, we can efficiently find occurrences of patterns P in T

  • The most popular solutions to this problem are suffix trees [29] and suffix arrays [21]. Both can be built in O(n) deterministic time on a text T of length n over an alphabet of size σ, and the best variants can count the number of times a string P appears in T in time O(|P |), and even in time O(|P |/ logσ n) in the word-RAM model if P is given packed into |P |/ logσ n words [26]

  • We have shown how to build, in O(n) deterministic time and using O(n log σ) bits of working space, a compressed self-index for a text T of length n over an alphabet of size σ that searches for patterns P in time O(|P | + log logw σ), on a w-bit word RAM machine

Read more

Summary

Introduction

The string indexing problem consists in preprocessing a string T so that, later, we can efficiently find occurrences of patterns P in T. Fischer and Gawrychowski [14] introduced the wexponential search trees, which yield dynamic suffix trees with counting time O(|P | + log log σ) All these structures can be built in linear deterministic time, but require Θ(n log n) bits of space, which challenges their practicality when handling large text collections. Our new self-index, with O(|P | + log logw σ) counting time, linear-time deterministic construction, and nHk(T ) + o(n log σ) bits of space, dominates all the compressed indexes with linear-time deterministic construction [1, 6], as well as some uncompressed ones [14] (to be fair, we do not cover the case log σ = O(log w), as in this case the previous work [6, Thm. 7] already obtains our result). The only aspect in which some of the dominated indexes outperform ours is in that they may use o(n(Hk(T )+1)) [6, Thm. 10] or o(n) [6, Thm. 7] bits of redundancy, instead of our o(n log σ) bits

Preliminaries
Rank and Select Queries
Suffix Array and Suffix Tree
Compressed Suffix Array and Tree
Burrows-Wheeler Transform and FM-index
Small Interval Rank Queries
Compressed Index
Pattern Search
Sequences and Related Structures
Structures Du
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call