We propose a joint source-channel coding algorithm capable of correcting some errors in the popular Lempel-Ziv'77 (LZ'77) scheme without introducing any measurable degradation in the compression performance. This can be achieved because the LZ'77 encoder does not completely eliminate the redundancy present in the input sequence. One source of redundancy can be observed when an LZ'77 phrase has multiple matches. In this case, LZ'77 can issue a pointer to any of those matches, and a particular choice carries some additional bits of information. We call a scheme with embedded redundant information the LZS'77 algorithm. We analyze the number of longest matches in such a scheme and prove that it follows the logarithmic series distribution with mean 1/h (plus some fluctuations), where h is the source entropy. Thus, the distribution associated with the number of redundant bits is well concentrated around its mean, a highly desirable property for error correction. These analytic results are proved by a combination of combinatorial, probabilistic, and analytic methods (e.g., Mellin transform, depoissonization, combinatorics on words). In fact, we analyze LZS'77 by studying the multiplicity matching parameter in a suffix tree, which in turn is analyzed via comparison to its independent version, called trie. Finally, we present an algorithm in which a channel coder (e.g., Reed-Solomon (RS) coder) succinctly uses the inherent additional redundancy left by the LZS'77 encoder to detect and correct a limited number of errors. We call such a scheme the LZRS'77 algorithm. LZRS'77 is perfectly backward-compatible with LZ'77, that is, a file compressed with our error-resistant LZRS'77 can still be decompressed by a generic LZ'77 decoder
Read full abstract