Canonical Huffman code based full-text index

Yi Zhang,Zhili Pei,Jinhui Yang,Yanchun Liang

doi:10.1016/j.pnsc.2007.11.001

Yi Zhang, Zhili Pei + Show 2 more

Open Access

https://doi.org/10.1016/j.pnsc.2007.11.001

Copy DOI

Abstract

Full-text indices are data structures that can be used to find any substring of a given string. Many full-text indices require space larger than the original string. In this paper, we introduce the canonical Huffman code to the wavelet tree of a string T [1… n]. Compared with Huffman code based wavelet tree, the memory space used to represent the shape of wavelet tree is not needed. In case of large alphabet, this part of memory is not negligible. The operations of wavelet tree are also simpler and more efficient due to the canonical Huffman code. Based on the resulting structure, the multi-key rank and select functions can be performed using at most nH 0 + ∣ Σ∣(lg lg n + lg n − lg ∣ Σ∣)+ O( nH 0) bits and in O( H 0) time for average cases, where H 0 is the zeroth order empirical entropy of T . In the end, we present an efficient construction algorithm for this index, which is on-line and linear.

Full Text