Abstract

Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures (sometimes with slight reduction in efficiency) many of the popular compression schemes, including the Lempel--Ziv family, run-length encoding, byte-pair encoding, Sequitur, and Re-Pair. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot\alpha_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $\alpha_k(n)$ is the inverse of the $k$th row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompres...

Highlights

  • Modern textual or semi-structured databases, e.g. for biological and WWW data, are huge, and are typically stored in compressed form

  • We further make the assumption that all memory cells can contain log N -bit integers – this many bits are needed just to represent the input to a random access query

  • For a CFG S of size n representing a string of length N we can decompress a substring of length m in time O(m + log N )

Read more

Summary

Introduction

Modern textual or semi-structured databases, e.g. for biological and WWW data, are huge, and are typically stored in compressed form. A query to such databases will typically retrieve only a small portion of the data. This presents several challenges: how to query the compressed data directly and efficiently, without the need for additional data structures (which can be many times larger than the compressed data), and how to retrieve the answers to the queries. The random access problem is to compactly represent S while supporting fast random access queries, that is, given an index i, 1 ≤ i ≤ N , report S[i]. Given an (uncompressed) pattern string P and S, the compressed pattern matching problem is to find all occurrences of P within S more efficiently than to naively decompress S into S and search for P in S. An important variant of the pattern matching problem is when we allow approximate matching (i.e., when P is allowed to appear in S with some errors)

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call