Abstract

The question of how to index documents is a central problem in document retrieval. The indexing problem can be stated as follows. There exists a large document collection, together with population of retrieval system customers, each of whom wants information that he thinks might be supplied by documents in the collection. How should the documents in the collection be identified (“indexed,” “cataloged,” etc.) so that the collection can be searched to the maximal collective benefit of the customers?The problem under investigation is that of developing a set of formal statistical rules for selecting the keywords of a document, the words likely to be useful as index terms for that document. A number of simple weighting techniques have been suggested for selecting the keywords of a document. These are: (i) frequency of occurrence in a document, (ii) frequency/document length, (iii) frequency/frequency in document collection, and (iv) frequency/(document length x frequency in collection). These have been examined in detail by Spark Jones, [Sp73]. The major results of her experiments show that there is no best technique except that (i) is consistently outperformed by the others. Her experiments also show that automatic indexing sometimes, but not always, outperforms manual controlled indexing. This has led to more sophisticated procedures for selecting keywords.The first technique for keyword recognition was developed by Salton, [Sa75], and is known as the discrimination value model. The technique measures the effectiveness of a term by examining what happens if that term is removed from the index. The assumption is made that if all the documents seem more similar to one another after a term has been removed from the index, then that term has a descriptive power whose magnitude is represented by change in total similarities. Salton has found significant retrieval improvement by making use of the discrimination value model to select the index terms for certain collections of documents. A second more sophisticated technique has been developed by Harter, [Ha75]. The technique is based upon the distribution characteristics of terms throughout the document collection. Harter's technique is based upon the hypothesis that authors choose terms, other than those directly related to the subject under discussion, randomly from a fixed vocabulary when composing a document. If this is in fact the case, then the distribution characteristics of the non-descriptive terms should be described by a Poisson distribution.It has been further hypothesized that the descriptive terms are chosen by authors randomly in relation to a particular topic. If this is the case, the distribution of these terms within documents dealing with the topic in question should also be describable by the Poisson function defined as f(k) = EXP(-1+k* LN(1))/k1 which gives the probability, f(k), that a document contains k occurrences of a particular term, 1 being the mean number of occurrences of the term in each document of the collection, and where the term is randomly distributed. This gives rise to the 2-Poisson model, [Bo75], which states that the distribution of a term within a document collection should be descritable by two Poisson distributions, one of which describes the usage of the term as a “background” term and the other its usage as a keyword. Thus the overall model is a combination of two Poisson functions and takes the form f(k) = p*EXP(-L1+k*LN(L1))/ k1+(1-p)*EXP(-L2+k*LN(L2))/k1 where L1 and L2 represent the mean number of occurrences of the term in each of the two classes and p is the proportion of documents in which the term is a keyword. Bookstein and Swanson, [Bo74], found that 2-Poisson model did not successfully describe the distributions of all keywords since the complete validity of the model is based on the rather naive assumption that there exactly two ways in which a term is used. Harter, [Ha75], suggests (L1-L2)/SQRT(L1+L2) as being an effective measurement of the usefulness of an index term.In his probabilistic approach to keyword selection, Harter [Ha75] used the less efficient moment estimators for estimating the parameters of mixtures of discrete distributions. Harter emphasized that the method of maximum likelihood provides iterative solutions rather than exact solutions for a mixture of two distributions, and that the solutions are very slow to converge, in general. Contending that it was back in the 1930's when computers were unavailable to the statisticians that the method of moment estimators would have been acceptable for estimating the parameters of the 2-Poisson distribution, Olagunju, [O180], has investigated the properties of the 2-Poisson model.In this presentation we show how a combination of the method of moment and the method of maximum likelihood can be used for estimating the parameters of the 2-Poisson distribution. The likelihood function for the 2-Poisson model is given by L(Xi) = PRODUCT [F(Xi/pi,L1,L2, i=1 to ∞], and Log[L(Xi)] = SUM [Ni*Log(pi*EXP(-L1+i*LN(L1))/i! +(1-pi)* EXP(-L2+i*LN(L2))/i!, i=0 to ∞]. The estimator Log[L(Xi)] is used to estimate the parameters pi, L1 and L2 since it is easier to find the maximum of the likelihood by it. In fact, by Taylor's series expansion, the point where the likelihood is a maximum is a solution of three systems of equations. The logarithm of the likelihood function for the Degenerate 2-Poisson model is given by Log [Xi] = N0*Log[pi*EXP(-L1)+(1-pi)] + SUM [Ni*Log(pi*EXP(-L1 + i*LN(L1)))/i!, i=1 to ∞]. In Olagunju's thesis, [O180], the combination of the 2-Poisson model and the Degenerate 2-Poisson model are examined in detail as models of keyword distribution, and formulae expressing the parameters of the models in terms of empirical frequency statistics are derived. Finally, a measure, consistent with the 2-Poisson and the Degenerate 2-Poisson models, intended to identify keywords is proposed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call