Abstract

A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZ-index [G. Navarro, Indexing text using the Ziv–Lempel trie, Journal of Discrete Algorithms (JDA) 2 (1) (2004) 87–114] stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text T [ 1 . . u ] over an alphabet of size σ, the LZ-index requires 4 | LZ | ( 1 + o ( 1 ) ) bits of space, where | LZ | is the size of the LZ78-compression of T. This can be bounded by | LZ | = u H k ( T ) + o ( u log σ ) , where H k ( T ) is the k-th order empirical entropy of T, for any k = o ( log σ u ) . The LZ-index is built in O ( u log σ ) time, yet requiring O ( u log u ) bits of main memory in the worst case. In practice, the LZ-index occupies 1.0–1.5 times the text size (and replaces the text), but its construction requires around 5 times the text size. This limits its applicability to medium-sized texts. In this paper we present a space-efficient algorithm to construct the LZ-index in O ( u ( log σ + log log u ) ) time and requiring 4 | LZ | ( 1 + o ( 1 ) ) bits of main memory, that is, asymptotically the same space of the final index. We also adapt our algorithm to construct more recent reduced versions of the LZ-index, which occupy from 1 to 3 times | LZ | ( 1 + o ( 1 ) ) bits, and show that these can also be built using asymptotically the same space of the final index. Finally, we study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index), and must use the disk for the rest. We show how to build all the LZ-index variants in O ( u ( log σ + log log u ) ) time, and within | LZ | ( 1 + o ( 1 ) ) bits of main memory, that is, asymptotically just the space to hold the LZ78-compressed text. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and being competitive with the best construction times of other compressed indexes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.