Abstract

Re-Pairis a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large-scale data sets. As a solution for this problem, we present, given a text of length n whose characters are drawn from an integer alphabet with size σ=nO(1), an O(min(n2,n2lglogτnlglglgn/logτn)) time algorithm computing Re-Pair with max((n/c)lgn,nlgτ)+O(lgn) bits of working space including the text space, where c≥1 is a fixed user-defined constant and τ is the sum of σ and the number of non-terminals. We give variants of our solution working in parallel or in the external memory model. Unfortunately, the algorithm seems not practical since a preliminary version already needs roughly one hour for computing Re-Pair on one megabyte of text.

Highlights

  • Re-Pair [1] is a grammar deriving a single string

  • Besides the seminal work of Larsson and Moffat [1], there are a couple of articles devoted to the compression aspects of Re-Pair: Given a text T of length n whose characters are drawn from an integer alphabet of size σ := nO(1), the output of Re-Pair applied to T is at most 2nHk ( T ) + o (n lg σ ) bits with k = o when represented naively as a list of character pairs [2], where Hk denotes the empirical entropy of the k-th order

  • We focus on the problem of computing the grammar with an algorithm working in text space, forming a bridge between the domain of in-place string algorithms, low-memory compression algorithms, and the domain of Re-Pair computing algorithms

Read more

Summary

Introduction

Re-Pair [1] is a grammar deriving a single string. It is computed by replacing the most frequent bigram in this string with a new non-terminal, recursing until no bigram occurs more than once. Re-Pair is a so-called irreducible grammar, its grammar size, i.e., the sum of the symbols on the right-hand side of all rules, is upper bounded by O(n/ logσ n) ([3], Lemma 2), which matches the information-theoretic lower bound on the size of a grammar for a string of length n. Charikar et al [6] (Section G) gave an easy variation to improve the size of the grammar Another variant, proposed by Claude and Navarro [12], runs in a user-defined working space (>n lg n bits) and shares with our proposed solution the idea of a table that (a) is stored with the text in the working space and (b) grows in rounds. Furuya et al [18] presented a variant, called MR-Re-Pair, in which a most frequent maximal repeat is replaced instead of a most frequent bigram

Related Work
Our Contribution
Preliminaries
Sequential Algorithm
Trade-Off Computation
Algorithmic Ideas
Algorithmic Details
Storing the Output In-Place
Step-by-Step Execution
Implementation
Bit-Parallel Algorithm
Broadword Search
Bit-Parallel Adaption
Computing MR-Re-Pair in Small Space
Parallel Algorithm
Computing Re-Pair in External Memory
Heuristics for Practicality
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.