A large-alphabet-oriented scheme for Chinese and English text compression

Hung-Yan Gu

doi:10.1002/spe.661

Abstract

In this paper, a large-alphabet-oriented scheme is proposed for both Chinese and English text compression. Our scheme parses Chinese text with the alphabet defined by Big-5 code, and parses English text with some rules designed here. Thus, the alphabet used for English is not a word alphabet. After a token is parsed out from the input text, zero-, first-, and second-order Markov models are used to estimate the occurrence probabilities of this token. Then, the probabilities estimated are blended and accumulated in order to perform arithmetic coding. To implement arithmetic coding under a large alphabet and probability-blending condition, a way to partition count-value range is studied. Our scheme has been programmed and can be executed as a software package. Then, typical Chinese and English text files are compressed to study the influences of alphabet size and prediction order. On average, our compression scheme can reduce a text file's size to 33.9% for Chinese and to 23.3% for English text. These rates are comparable with or better than those obtained by popular data compression packages. Copyright © 2005 John Wiley & Sons, Ltd.

Full Text