Abstract

The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b>1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b=8 in case of bytes) are related as zb=O(bzlognz). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb=O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods.

Highlights

  • Lempel-Ziv Parsing for SequencesThe Lempel-Ziv parsing (LZ77) [1,2] is one of the central techniques in data compression and string algorithms

  • Our main result is that, for any integer b > 1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as zb = O(bz log nz )

  • If a string s is produced by an straight line program (SLP) grammar of size g, there exists a phrase-aligned LZ77 parsing f 1 f 2 . . . f z for s of size at most g

Read more

Summary

Lempel-Ziv Parsing for Sequences

The Lempel-Ziv parsing (LZ77) [1,2] is one of the central techniques in data compression and string algorithms. Our main result is that, for any integer b > 1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as zb = O(bz log nz ) (a more precise formulation follows) We partially complement this upper bound with a lower bound zb = Ω(bz) in a series of examples. A better lower bound zb = Ω(bz log n), which would show that our main result is tight, even only for b = 2, would imply that the minimal grammar generating the string attaining this bound is of size Ω(z log n), removing the O(log log n)-factor gap This gives a new approach to attack this problem.

LZ77 Parsings
Block Contractions and a Lower Bound for Their LZ77 Parsings
Upper Bounds on LZ77 Parsings for Block Contractions
Basic Ideas
Greedy Phrase-Splitting Procedure
Formalized Recursive Phrase-Splitting Procedure
Basic Analysis of the Number of Produced Phrases
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call