Random Access to Grammar-Compressed Strings and Trees

Philip Bille,Rajeev Raman,Gad M Landau,Oren Weimann,Kunihiko Sadakane,Srinivasa Rao Satti

doi:10.1137/130936889

Philip Bille, Rajeev Raman + Show 4 more

Open Access

https://doi.org/10.1137/130936889

Copy DOI

Abstract

Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures (sometimes with slight reduction in efficiency) many of the popular compression schemes, including the Lempel--Ziv family, run-length encoding, byte-pair encoding, Sequitur, and Re-Pair. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot\alpha_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $\alpha_k(n)$ is the inverse of the $k$th row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompres...

Highlights

Modern textual or semi-structured databases, e.g. for biological and WWW data, are huge, and are typically stored in compressed form
We further make the assumption that all memory cells can contain log N -bit integers – this many bits are needed just to represent the input to a random access query
For a CFG S of size n representing a string of length N we can decompress a substring of length m in time O(m + log N )

Summary

Introduction

Modern textual or semi-structured databases, e.g. for biological and WWW data, are huge, and are typically stored in compressed form. A query to such databases will typically retrieve only a small portion of the data. This presents several challenges: how to query the compressed data directly and efficiently, without the need for additional data structures (which can be many times larger than the compressed data), and how to retrieve the answers to the queries. The random access problem is to compactly represent S while supporting fast random access queries, that is, given an index i, 1 ≤ i ≤ N , report S[i]. Given an (uncompressed) pattern string P and S, the compressed pattern matching problem is to find all occurrences of P within S more efficiently than to naively decompress S into S and search for P in S. An important variant of the pattern matching problem is when we allow approximate matching (i.e., when P is allowed to appear in S with some errors)

Objectives

Results

Conclusion