Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Roberto Grossi,Jeffrey Scott Vitter

doi:10.1137/s0097539702402354

Abstract

The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ consisting of $n$ symbols drawn from a fixed alphabet $\Sigma$. The text $T$ can be represented in $n \lg |\Sigma|$ bits by encoding each symbol with $\lg |\Sigma|$ bits. The goal is to support fast online queries for searching any string pattern $P$ of $m$ symbols, with $T$ being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require $\Omega(n \lg n)$ additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need $\Omega(n)$ memory words, each of $\Omega(\lg n)$ bits. These indexes are larger than the text itself by a multiplicative factor of $\Omega(\smash{\lg_{|\Sigma|} n})$, which is significant when $\Sigma$ is of constant size, such as in \textsc{ascii} or \textsc{unicode}. On the other hand, these indexes support fast searching, either in $O(m \lg |\Sigma|)$ time or in $O(m + \lg n)$ time, plus an output-sensitive cost $O(\mathit{occ})$ for listing the $\mathit{occ}$ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast $\smash{O(m /\lg_{|\Sigma|} n + \lg_{|\Sigma|}^\epsilon n)}$ search time in the worst case, for any constant $0 < \epsilon \leq 1$, using at most $\smash{\bigl(\epsilon^{-1} + O(1)\bigr) \, n \lg |\Sigma|}$ bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB \textsc{ascii} file can require 30--40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve \emph{both} time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving $O(\mathit{occ} \, \smash{\lg_{|\Sigma|}^\epsilon n})$ time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in $O(n \lg |\Sigma|)$ bits to obtain a total search bound of $O(m /\lg_{|\Sigma|} n + \mathit{occ})$ time, which is optimal.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: SIAM Journal on Computing	Publication Date: Jan 1, 2005
Citations: 458	License type: other-oa

R Discovery Prime

R Discovery Prime

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Abstract

Talk to us

Similar Papers

More From: SIAM Journal on Computing

Lead the way for us

Similar Papers

Solving All-Pairs Suffix Prefix – Theory and Practice
Maan Haj Rachid ... Qutaibah Malluhi
-
Maan Haj Rachid, et. al.Maan Haj Rachid ... Qutaibah Malluhi
01 Jan 2015
01 Jan 2015

Indexing huge genome sequences for solving various problems.
K Sadakane ... T Shibuya
Genome Informatics | VOL. 12
K Sadakane, et. al.K Sadakane ... T Shibuya
01 Jan 2001
Genome Informatics | VOL. 12

Breaking the -Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.
Dominik Kempa ... Tomasz Kociumaka
Proceedings of the ... Annual ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Symposium on Discrete Algorithms | VOL. 2023
Dominik Kempa, et. al.Dominik Kempa ... Tomasz Kociumaka
01 Jan 2023
Proceedings of the ... Annual ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Symposium on Discrete Algorithms | VOL. 2023

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices
Wing-Kai Hon ... Kunihiko Sadakane
SIAM Journal on Computing | VOL. 38
Wing-Kai Hon, et. al.Wing-Kai Hon ... Kunihiko Sadakane
01 Jan 2009
SIAM Journal on Computing | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Abstract

Talk to us

Similar Papers

More From: SIAM Journal on Computing