Fast Compressed Self-indexes with Deterministic Linear-Time Construction

J Ian Munro,Gonzalo Navarro,Yakov Nekrich

doi:10.1007/s00453-019-00637-x

J Ian Munro, Gonzalo Navarro + Show 1 more

Open Access

https://doi.org/10.1007/s00453-019-00637-x

Copy DOI

Abstract

We introduce a compressed suffix array representation that, on a text T of length n over an alphabet of size $$\sigma $$, can be built in O(n) deterministic time, within $$O(n\log \sigma )$$ bits of working space, and counts the number of occurrences of any pattern P in T in time $$O(|P| + \log \log _w \sigma )$$ on a RAM machine of $$w=\Omega (\log n)$$-bit words. This time is almost optimal for large alphabets ($$\log \sigma =\Theta (\log n)$$), and it outperforms all the other compressed indexes that can be built in linear deterministic time, as well as some others. The only faster indexes can be built in linear time only in expectation, or require $$\Theta (n\log n)$$ bits. For smaller alphabets, where $$\log \sigma = o(\log n)$$, we show how, by using space proportional to a compressed representation of the text, we can build in linear time an index that counts in time $$O(|P|/\log _\sigma n + \log _\sigma ^\epsilon n)$$ for any constant $$\epsilon >0$$. This is almost RAM-optimal in the typical case where $$w=\Theta (\log n)$$.

Highlights

The string indexing problem consists in preprocessing a string T so that, later, we can efficiently find occurrences of patterns P in T
The most popular solutions to this problem are suffix trees [29] and suffix arrays [21]. Both can be built in O(n) deterministic time on a text T of length n over an alphabet of size σ, and the best variants can count the number of times a string P appears in T in time O(|P |), and even in time O(|P |/ logσ n) in the word-RAM model if P is given packed into |P |/ logσ n words [26]
We have shown how to build, in O(n) deterministic time and using O(n log σ) bits of working space, a compressed self-index for a text T of length n over an alphabet of size σ that searches for patterns P in time O(|P | + log logw σ), on a w-bit word RAM machine

Summary

Introduction

The string indexing problem consists in preprocessing a string T so that, later, we can efficiently find occurrences of patterns P in T. Fischer and Gawrychowski [14] introduced the wexponential search trees, which yield dynamic suffix trees with counting time O(|P | + log log σ) All these structures can be built in linear deterministic time, but require Θ(n log n) bits of space, which challenges their practicality when handling large text collections. Our new self-index, with O(|P | + log logw σ) counting time, linear-time deterministic construction, and nHk(T ) + o(n log σ) bits of space, dominates all the compressed indexes with linear-time deterministic construction [1, 6], as well as some uncompressed ones [14] (to be fair, we do not cover the case log σ = O(log w), as in this case the previous work [6, Thm. 7] already obtains our result). The only aspect in which some of the dominated indexes outperform ours is in that they may use o(n(Hk(T )+1)) [6, Thm. 10] or o(n) [6, Thm. 7] bits of redundancy, instead of our o(n log σ) bits

Preliminaries

Rank and Select Queries

Suffix Array and Suffix Tree

Compressed Suffix Array and Tree

Burrows-Wheeler Transform and FM-index

Small Interval Rank Queries

Compressed Index

Pattern Search

Sequences and Related Structures

Structures Du

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithmica	Publication Date: Oct 22, 2019
Citations: 2	License type: cc-by

R Discovery Prime

R Discovery Prime

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithmica

Lead the way for us

Similar Papers

Fast Compressed Self-Indexes with Deterministic Linear-Time Construction
...
-
, et. al. ...
06 Jul 2017
06 Jul 2017

A linear time algorithm for quantum 2-SAT
...
-
, et. al. ...
28 Aug 2015
28 Aug 2015

Space-efficient construction of compressed indexes in deterministic linear time
...
-
, et. al. ...
16 Jan 2017
16 Jan 2017

ETH-Hardness of Approximating 2-CSPs and Directed Steiner Network
...
-
, et. al. ...
10 May 2018
10 May 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithmica