Lempel-Ziv Parsing for Sequences of Blocks

Daniel Valenzuela,Dmitry Kosolobov

doi:10.3390/a14120359

Abstract

The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b>1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b=8 in case of bytes) are related as zb=O(bzlognz). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb=O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods.

Highlights

Lempel-Ziv Parsing for SequencesThe Lempel-Ziv parsing (LZ77) [1,2] is one of the central techniques in data compression and string algorithms
Our main result is that, for any integer b > 1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as zb = O(bz log nz )
If a string s is produced by an straight line program (SLP) grammar of size g, there exists a phrase-aligned LZ77 parsing f 1 f 2 . . . f z for s of size at most g

Summary

Lempel-Ziv Parsing for Sequences

The Lempel-Ziv parsing (LZ77) [1,2] is one of the central techniques in data compression and string algorithms. Our main result is that, for any integer b > 1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as zb = O(bz log nz ) (a more precise formulation follows) We partially complement this upper bound with a lower bound zb = Ω(bz) in a series of examples. A better lower bound zb = Ω(bz log n), which would show that our main result is tight, even only for b = 2, would imply that the minimal grammar generating the string attaining this bound is of size Ω(z log n), removing the O(log log n)-factor gap This gives a new approach to attack this problem.

LZ77 Parsings

Block Contractions and a Lower Bound for Their LZ77 Parsings

Upper Bounds on LZ77 Parsings for Block Contractions

Basic Ideas

Greedy Phrase-Splitting Procedure

Formalized Recursive Phrase-Splitting Procedure

Basic Analysis of the Number of Produced Phrases

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Lempel-Ziv Parsing for Sequences of Blocks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Journal: Algorithms	Publication Date: Dec 10, 2021
License type: CC BY 4.0

Similar Papers

Rank and Select for Succinct Data Structures
Oscar Pedreira ... Antonio Fariña
Electronic Notes in Theoretical Computer Science | VOL. 236
Oscar Pedreira, et. al.Oscar Pedreira ... Antonio Fariña
27 Mar 2009
Electronic Notes in Theoretical Computer Science | VOL. 236

Chapter 5 - Sending and Receiving Data
...
TCP/IP Sockets in C | VOL. -
, et. al. ...
01 Jan 2009
TCP/IP Sockets in C | VOL. -

Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited.
Łukasz Dębowski
Entropy (Basel, Switzerland) | VOL. 20
Łukasz DębowskiŁukasz Dębowski
26 Jan 2018
Entropy (Basel, Switzerland) | VOL. 20

The influence of the memory for a special permutation channel
U Tamm
-
U TammU Tamm
17 Sep 1995
17 Sep 1995

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lempel-Ziv Parsing for Sequences of Blocks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms