On the MDL principle for i.i.d. sources with large alphabets

G.I Shamir

doi:10.1109/tit.2006.872846

Abstract

Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m<min{k,n} alphabet symbols occur

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

On the MDL principle for i.i.d. sources with large alphabets

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Information Theory

Lead the way for us

Journal: IEEE Transactions on Information Theory	Publication Date: May 1, 2006
Citations: 41

Similar Papers

Average case universal lossless compression with unknown alphabets
G.I Shamir
-
G.I ShamirG.I Shamir
27 Jun 2004
27 Jun 2004

Universal Lossless Compression With Unknown Alphabets—The Average Case
G.I Shamir
IEEE Transactions on Information Theory | VOL. 52
G.I ShamirG.I Shamir
01 Nov 2006
IEEE Transactions on Information Theory | VOL. 52

Bounds on the entropy of patterns of I.I.D. sequences
G.I Shamir
-
G.I ShamirG.I Shamir
01 Jan 2004
01 Jan 2004

Sequential universal lossless techniques for compression of patterns and their description length
G.I Shamir
-
G.I ShamirG.I Shamir
23 Mar 2004
23 Mar 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the MDL principle for i.i.d. sources with large alphabets

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Information Theory