Practical constructions of L-restricted alphabetic prefix codes

E.S Laber,R.L Milidiu,A.A Pessoa

doi:10.1109/spire.1999.796585

Abstract

Information retrieval systems use various search techniques such as B-trees, inverted files and suffix arrays to provide quick response. Many of these techniques rely on string comparison operations. If a record field is coded using Huffman codes (D.A. Huffman, 1952) in order to save storage space, the field must be decoded before performing any comparison. On the other hand, if the field is alphabetically coded, then the comparison can be directly applied to the sequence of codewords, which is faster. This approach also saves storage space, in comparison with the case where no data compression is applied. Experiments with alphabetically coded texts indexed with suffix arrays were reported by E.S. Moura et al. (1997). We consider the construction of L-restricted ABPC (alphabetic binary prefix code) which satisfies l/sub i//spl les/L for i=1,...,n. Optimal L-restricted ABPC can be constructed in O(nLlogn) time, using O(nL) space (L.L Larmore and T.M. Przytycka, 1994). Nevertheless, due to its space requirements, this method turns out to be prohibitive for larger values of n. We suggest a simple approach to construct suboptimal L-restricted ABPC. Our approach is divided into three phases. In the first phase, we verify if an optimal ABPC is also an optimal L-restricted ABPC. In the second one, we obtain a L-restricted prefix code (not necessarily alphabetical) and in the third phase we turn this code into an alphabetical one. We denote this approach by 3-phase algorithm . The codes generated through this algorithm are called 3-phase codes. We analyze the time and space complexities and compare the average length of the 3-phase code against the Shannon Entropy. We also compare the average length of the Huffman code against the average length of an optimal L-restricted ABPC.

Full Text