Natural Language Compression per Blocks

Petr Proch´Zka,Jan Holub

doi:10.1109/ccp.2011.25

Abstract

We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed as the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture. It can send namely single compressed blocks only with corresponding part of the vocabulary. Further STBDC enables various approaches of updating and extending of the compressed text.

Full Text