Parsing strategies for BWT compression

R.Y.K Isal,A Moffat

doi:10.1109/dcc.2001.917174

Abstract

Block-sorting is an innovative compression mechanism introduced by Burrows and Wheeler (1994), and has been the subject of considerable scrutiny in the years since it first became public. Block-sorting compression is usually described as involving three steps: permuting the input one block at a time through the use of the Burrows-Wheeler transform (BWT); applying a move-to-front (MTF) transform to each of the permuted blocks; and then entropy coding the output with a Huffman or arithmetic coder. In this paper we prepend a fourth transformation to this sequence: parsing. In the BWT implementations that have been considered to date the unit of transmission has been taken to be the ASCII character. But there is no particular reason why this should be so, and a range of other strategies can be used to construct the sequence of symbols that is fed into the BWT process. We consider some of the issues associated with making this change, and show that in some situations the introduction of a simple parsing stage allows improved compression to be obtained compared to an otherwise equivalent character-based BWT implementation. We also describe an MTF-like ranking transformation that caters better to large-alphabet situations than does the strict MTF rule used in conventional BWT implementations.

Full Text