Average Redundancy for Known Sources: Ubiquitous Trees in Source Coding

Wojciech Szpankowski

doi:10.46298/dmtcs.3555

Abstract

Analytic information theory aims at studying problems of information theory using analytic techniques of computer science and combinatorics. Following Hadamard's precept, these problems are tackled by complex analysis methods such as generating functions, Mellin transform, Fourier series, saddle point method, analytic poissonization and depoissonization, and singularity analysis. This approach lies at the crossroad of computer science and information theory. In this survey we concentrate on one facet of information theory (i.e., source coding better known as data compression), namely the $\textit{redundancy rate}$ problem. The redundancy rate problem determines by how much the actual code length exceeds the optimal code length. We further restrict our interest to the $\textit{average}$ redundancy for $\textit{known}$ sources, that is, when statistics of information sources are known. We present precise analyses of three types of lossless data compression schemes, namely fixed-to-variable (FV) length codes, variable-to-fixed (VF) length codes, and variable-to-variable (VV) length codes. In particular, we investigate average redundancy of Huffman, Tunstall, and Khodak codes. These codes have succinct representations as $\textit{trees}$, either as coding or parsing trees, and we analyze here some of their parameters (e.g., the average path from the root to a leaf).

Highlights

The basic problem of source coding better known as data compression is to find a binary code that can be unambiguously recovered with shortest possible description either on average or for individual sequences
We show that the average redundancy either converges to an explicitly computable constant, as the block length increases, or it exhibits a very erratic behavior fluctuating between 0 and 1
Savari and Gallager [112] present an analysis of the dominant term in the asymptotic expansion of the Tunstall code redundancy. In this survey, following [33], we describe a precise analysis of the phrase length for such a code and its average redundancy

Summary

Introduction

The basic problem of source coding better known as (lossless) data compression is to find a binary code that can be unambiguously recovered with shortest possible description either on average or for individual sequences. Thanks to Shannon’s work we know that on average the number of binary bits per source symbol cannot be smaller than the source entropy rate. There are many codes achieving the entropy, one turns attention to redundancy. The average redundancy of a source code is the amount by which the expected number of binary digits per source symbol for that code exceeds entropy. One of the goals in designing source coding algorithms is to minimize the average redundancy. In this survey, we discuss various classes of source coding and their corresponding average redundancy. It turns out that such analyses often resort to studying certain intriguing trees such as Huffman, Tunstall and Khodak trees.

Objectives

Results

Conclusion