Processing truncated terms in document retrieval systems

Paul Bratley,Yaacov Choueka

doi:10.1016/0306-4573(82)90004-8

Abstract

In a typical inverted-file full-text document retrieval system, the user submits queries consisting of strings of characters combined by various operators. The strings are looked up in a text-dictionary which lists, for each string, all the places in the database at which it occurs. It is desirable to allow the user to include in his query truncated terms such as X ∗, ∗ X, ∗ X ∗, or X ∗ Y, where X and X are specified strings and ∗ is a variable-length-don't-care character, that is, ∗ represents an arbitrary, possibly empty, string. Processing these terms involves finding the set of all words in the dictionary that match these patterns. How to do this efficiently is a long-standing open problem in this domain. In this paper we present a uniform and efficient approach for processing all such query terms. The approach, based on a “permuted dictionary” and a corresponding set of access routines, requires essentially one disk access to obtain from the dictionary all the strings represented by a truncated term, with negligible computing time. It is thus well suited for on-line applications. Implementation is simple, and storage overhead is low: it can be made almost negligible by using some specially adapted compression techniques described in the paper. The basic approach is easily adaptable for slight variants, such as fixed (or bounded) length don't-care characters, or more complex pattern matching templates.

Full Text