Abstract

Signature files seem to be a promising access method for text and attributes. According to this method, the documents (or records) are stored sequentially in one file ("text file"), while abstractions of the documents ("signatures") are stored sequentially in another file ("signature file"). In order to resolve a query, the signature file is scanned first, and many nonqualifying documents are immediately rejected. We develop a framework that includes primary key hashing, multiattribute hashing, and signature files. Our effort is to find the optimal signature extraction method. The main contribution of this paper is that we present optimal and efficient suboptimal algorithms for assigning words to signatures in several environments. Another contribution is that we use information theory, and study the relationship of the false drop probability F d and the information that is lost during signature extraction. We give tight lower bounds on the achievable F d and show that a simple relationship holds between the two quantities in the case of optimal signature extraction with uniform occurrence and query frequencies. We examine hashing as a method to map words to signatures (instead of the optimal way), and show that the same relationship holds between F d and loss , indicating that an invariant may exist between these two quantities for every signature extraction method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call