LIPT: A Reversible Lossless Text Transform to Improve Compression Performance

Fauzia Salim Awan ,Razib Iqbal ,N Motgi ,N Zhang ,Amar Mukherjee

doi:10.1109/dcc.2001.10039

Abstract

In the past, lossless compression researchers have developed highly sophisticated text compression algorithms. We propose an alternative approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm’s ability to compress. The basic idea is to encode every word in the input text file, which is also found in the English text dictionary that we are using, as a word in our transformed static dictionary. These transformed words give shorter length for most of the input words and also retain some context information. Thus we achieve some compression at the preprocessing stage as well as creating additional context for the compression algorithms to exploit. We collect about 60000 English words as dictionary and sort them by the length. The words with same length are sorted by their frequency. Space character is the delimiter for the text and all the words are encoded starting with ‘*’. The second encoded character indicates the length of the word represented by ‘a’, ‘b’, ‘c’, for length 1, 2, 3, respectively, and so on. The rest of the characters in the encoded word indicate the offset of the English word in the block of words with same length and encoded as ‘a’, ‘b’, ‘z’, ‘A’.. ‘Z’, ‘aa’,... In the current dictionary, all the words can be encoded by no more than 5 characters including ‘*’. For instance, the 1 word of length 10 in the English dictionary, ‘autostrada’, is encoded as ‘*j’, the 2 ‘aspidistra’, as ‘*ja’, the 79 as ‘*jaA’, the 105 as ‘*jba’, and the 2809 as ‘*jaba’, and so on. Words not in the English dictionary will be passed to the transformed text unaltered. The transformation also handles special characters, punctuation marks and capitalization. Obviously, LIPT (Length Index Preserved Transformation) has a fixed context for the space character and ‘*’ and other useful context by the very nature of the encoding scheme. LIPT beats all the available text compression methods in the literature uniformly with respect the compression ratio. For Bzip2 and PPM, the most efficient compression algorithms so far, the average BPC using Bzip2 is 2.28, and using Bzip2 with LIPT gives average of 2.16, a 5.24% improvement on the corpus combining the text files in Canterbury, Calgary and Gutenberg corpus. PPMD (order 5) gives an average BPC of 2.14, and PPMD with LIPT gives 2.04, an improvement of 4.46%, and they have almost the same amount of compression time. The commercialized Gzip –9 with LIPT shows an improvement of 6.78% in average BPC over the Gzip –9. Bzip2 with LIPT, although 79.12% slower than the original Bzip in compression time, achieves average BPC almost equal to that of original PPMD. The overhead of downloading 0.5MB dictionary, which is done only once along with transmitting the first file to the receiver, can be absorbed after 9.5MB data are transmitted in LIPT with Bzip2 scheme. With increasing dictionary size, this threshold will go up, but the amortized cost will be negligible when thousands of files are transmitted. There is a 5.91% improvement in transmission time by Bzip2 with LIPT over Bzip2 and there is a 1.97% improvement in transmission time by Gzip with LIPT over Gzip. Average compression time using LIPT is 223% slower compared to Gzip. The decompression times are 93.3% slower, 566% slower and 5.9% faster compared to original Bzip2, Gzip and PPMD respectively. This research is supported by NSF Award No. IIS-9977336.

Full Text