In situ generation of compressed inverted files

Alistair Moffat,Timothy A H Bell

doi:10.1002/(sici)1097-4571(199508)46:7<537::aid-asi7>3.0.co;2-p

Alistair Moffat, Timothy A H Bell

https://doi.org/10.1002/(sici)1097-4571(199508)46:7<537::aid-asi7>3.0.co;2-p

Copy DOI

Abstract

An inverted index stores, for each term that appears in a collection of documents, a list of document numbers containing that term. Such an index is indispensable when Boolean or informal ranked queries are to be answered. Construction of the index is, however, a nontrivial task. Simple methods using in-memory data structures cannot be used for large collections because they require too much random access storage, and traditional disk-based methods require large amounts of temporary file space. This paper describes a new indexing algorithm designed to create large compressed inverted indexes in situ. It makes use of simple compression codes for the positive integers and an in-place external multi-way mergesort. The new technique has been used to invert a two-gigabyte text collection in under 4 hours, using less than 40 megabytes of temporary disk space, and less than 20 megabytes of main memory. © 1995 John Wiley & Sons, Inc.

Full Text