Abstract

The smallest grammar problem—namely, finding a smallest context-free grammar that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery. We propose a new perspective on this problem by splitting it into two tasks: (1) choosing which words will be the constituents of the grammar and (2) searching for the smallest grammar given this set of constituents. We show how to solve the second task in polynomial time parsing longer constituent with smaller ones. We propose new algorithms based on classical practical algorithms that use this optimization to find small grammars. Our algorithms consistently find smaller grammars on a classical benchmark reducing the size in 10% in some cases. Moreover, our formulation allows us to define interesting bounds on the number of small grammars and to empirically compare different grammars of small size.

Highlights

  • The smallest grammar problem—namely, finding a smallest context-free grammar that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery.The size of a smallest grammar can be considered a computable variant of Kolmogorov complexity, in which the Turing machine description of the sequence is restricted to context-free grammars.The problem is decidable, but still hard: the problem of finding a smallest grammar with an is NP-HARD [1]

  • Note that no Iterative Repeat Replacement (IRR) algorithm could generate G∗ and, by enumeration we find that the smallest possible grammar that can be obtained with an IRR algorithm has size 46 + |Gmin (α)| + |Gmin (β)| +

  • We analyzed a new approach to the Smallest Grammar Problem, which consisted in optimizing separately the choice of words that are going to be constituents, and the choice of which occurrences of these constituents will be rewritten by non-terminals

Read more

Summary

Introduction

The smallest grammar problem—namely, finding a smallest context-free grammar that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery. In order to derive a score function corresponding to C OMPRESSIVE, note that replacing a word ω by a non-terminal results in a contraction of the grammar of (|ω|−1)∗oP (ω) and its inclusion in the grammar adds |ω| + 1 to the grammar size This defines f (ω, P) = fM C (ω, P) = (|ω| − 1) ∗ (oP (ω) − 1) − 2. Once an IRR algorithm has chosen a repeated word ω, it replaces all non-overlapping occurrences of that word in the current grammar by a new non-terminal N and adds N → ω to the set of production rules. If Q is a subset of the repeats of the sequence s, we denote by mgp({s} ∪ Q) the set of production rules P corresponding to one of the minimal grammar parsing of {s} ∪ Q. This means gaining 9 symbols and losing only 6 (because of the introduction of the new right-hand sides)

Experiments
Findings
Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.