Indexing compressed text

Giovanni Manzini,Paolo Ferragina

doi:10.1145/1082036.1082039

Abstract

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P [1, p ] within a text T [1, n ] in O ( p + occ log 1+ε n ) time for any chosen ε, 0<ε<1. This data structure uses at most 5 n H k ( T ) + o ( n ) bits of storage, where H k ( T ) is the k th order empirical entropy of T . The space usage is Θ( n ) bits in the worst case and o ( n ) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array .Our second compressed data structure achieves O ( p + occ ) query time using O ( n H k ( T )log ε n ) + o ( n ) bits of storage for any chosen ε, 0<ε<1. Therefore, it provides optimal output-sensitive query time using o ( n log n ) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.

Full Text