Indexing exhaustivity and the computation of similarity matrices

Alan F Harding,Peter Willett

doi:10.1002/asi.4630310411

Abstract

AbstractSome of the automatic classification procedures used in information retrieval derive clusters of documents from an intermediate similarity matrix, the computation of which involves comparing each of the documents in the collection with all of the others. It has recently been suggested that many of these comparisons, specifically those between documents having no terms in common, may be avoided by means of the use of an inverted file to the document collection. This communication shows that the approach will effect reductions in the number of interdocument comparisons only if the documents are each indexed by a limited number of indexing terms; if exhaustive indexing is used, many document pairs will be compared several times over and the computation will be greater than when conventional approaches are used to generate the similarity matrix.

Full Text