COMPRESSING THE DIGITAL LIBRARY

Timothy C Bell ,Alistair Moffat ,Ian H Witten

doi:10.11575/prism/31147

Abstract

these two are in tension—data compression saves space, but at the expense of added access time; and indexing methods provide fast access, but usually at the expense of considerable amounts of additional storage space. Indeed, a complete index to a large body of text can be larger than the text itself—after all, it might store the location of every word in the text. The prospect of digital libraries presents the challenge of storing vast amounts of information efficiently and in a way that facilitates rapid search and retrieval. Storage space can be reduced by appropriate compression techniques, and searching can be enabled by constructing a full-text index. But these two requirements are in conflict: the need for decompression increases access time, and the need for an index increases space requirements. This paper resolves the conflict by showing how (a) large bodies of text can be compressed and indexed into less than half the space required by the original text alone, (b) full-text queries (Boolean or ranked) can be answered in small fractions of a second, and (c) documents can be decoded at the rate of approximately one megabyte a second. Moreover, a document database can be compressed and indexed at the rate of several hundred megabytes an hour. This paper shows that it is possible to make compression and indexing work together efficiently. We describe compression methods suited to large amounts of text that allow both random-access decoding of individual documents and fast execution; we show how the index can itself be compressed so that only a small amount of overhead space is required to store it; and we show how the application of compression techniques allows efficient construction of the index in the first instance. The result is a system that can take a large body of text, and convert it to a compressed text and index that together take up less than half the space occupied by the original data. Combining the two techniques incurs little penalty in access speed—in fact, the access speed can even be improved, since there is less data to be read in from slow secondary storage devices. Moreover, the initial indexing and compression processes can be effected on a mid-range workstation at a rate of several hundred megabytes an hour.

Full Text