Space-efficient construction of Lempel–Ziv compressed text indexes

Diego Arroyuelo,Gonzalo Navarro

doi:10.1016/j.ic.2011.03.001

Diego Arroyuelo, Gonzalo Navarro

Open Access

https://doi.org/10.1016/j.ic.2011.03.001

Copy DOI

Journal: Information and Computation	Publication Date: Apr 7, 2011
Citations: 85	License type: elsevier-specific: oa user license

Affiliation: University of Chile

Abstract

A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZ-index [G. Navarro, Indexing text using the Ziv–Lempel trie, Journal of Discrete Algorithms (JDA) 2 (1) (2004) 87–114] stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text T [ 1 . . u ] over an alphabet of size σ, the LZ-index requires 4 | LZ | ( 1 + o ( 1 ) ) bits of space, where | LZ | is the size of the LZ78-compression of T. This can be bounded by | LZ | = u H k ( T ) + o ( u log σ ) , where H k ( T ) is the k-th order empirical entropy of T, for any k = o ( log σ u ) . The LZ-index is built in O ( u log σ ) time, yet requiring O ( u log u ) bits of main memory in the worst case. In practice, the LZ-index occupies 1.0–1.5 times the text size (and replaces the text), but its construction requires around 5 times the text size. This limits its applicability to medium-sized texts. In this paper we present a space-efficient algorithm to construct the LZ-index in O ( u ( log σ + log log u ) ) time and requiring 4 | LZ | ( 1 + o ( 1 ) ) bits of main memory, that is, asymptotically the same space of the final index. We also adapt our algorithm to construct more recent reduced versions of the LZ-index, which occupy from 1 to 3 times | LZ | ( 1 + o ( 1 ) ) bits, and show that these can also be built using asymptotically the same space of the final index. Finally, we study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index), and must use the disk for the rest. We show how to build all the LZ-index variants in O ( u ( log σ + log log u ) ) time, and within | LZ | ( 1 + o ( 1 ) ) bits of main memory, that is, asymptotically just the space to hold the LZ78-compressed text. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and being competitive with the best construction times of other compressed indexes.

Full Text