ODC: Frame for definition of Dense codes

Petr Procházka,Jan Holub

doi:10.1016/j.ejc.2012.07.014

Abstract

Natural language compression has made great progress in the last two decades. The main step in this evolution was the introduction of word-based compression by Moffat. Another improvement came with so-called Dense codes, which proved to be very fast in compression and decompression while keeping a good compression ratio and direct search capability. Many variants of the Dense codes have been described, each of them using its own definition. In this paper, we present a generalized concept of dense coding called Open Dense Code (ODC), which aims to be a frame for the definition of many other dense code schemas. ODC underlines common features of the dense code schemas but at the same time allows one to express the divergences of each of them. Using the frame of ODC, we present two new word-based statistical compression algorithms based on the dense coding idea: Two Byte Dense Code (TBDC) and Self-Tuning Dense Code (STDC). Our algorithms improve the compression ratio and are considerate to smaller files, which are very often omitted by other compressors.

Full Text