DZ A text compression algorithm for natural languages

Dominique Revuz,Marc Zipstein

doi:10.1007/3-540-56024-6_16

Abstract

«Texts written in a natural language are essentially made of words of this language». We use this obvious fact, together with an extensive lexicon to define a good model of the statistical behavior of letters in texts. This model is used with the arithmetic coding scheme to build an efficient universal data compression method. Initially our method was specialized in the compression of French texts. However it can be easily adapted to other languages. Tests show that the compression ratio obtained by our method is on the average 30% on French texts. On the same texts Ziv & Lempel's method yields an average ratio of 40%. On other kinds of test files (English text, executable files, sources) the use of an order 1 Markov chain leads to results of the same order as Ziv & Lempel's. We present a new approach to dynamic dictionary construction for natural language compression. The fact well known to linguists that the number of different words is small, makes a dynamic construction possible.KeywordsEnglish TextArithmetic CodePunctuation MarkNatural Language TextCurrent IntervalThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text