Entropies of Chinese texts based on three models of Hanyu Pinyin phonetic system

S.Y Huang,G.H Ong

doi:10.1109/sicon.1993.515776

Abstract

Entropy indicates the lower bound to the number of bits required to represent the information in the texts of a language. It is a function of the probability distribution of the language units. A set of language units with their probabilities is just a model of the texts. A different set of language units and probabilities provides a different model. This paper reports on the study of entropies of Chinese texts provided by three models based on the Chinese phonetic system, Hanyu Pinyin. These models yield higher values of entropies than the ideogram-based model. However, Chinese texts transcribed in Hanyu Pinyin are a simple way to do Chinese input and no translation is needed before storage in computer systems. In addition, the coding of frequency table in static and semi-adaptive text compression schemes is much smaller than that for ideograms. This is an important advantage for compression of small to medium sized text files.

Full Text