Abstract
This paper deals with a problem that can arise when the aim is to cluster a document collection by textual feature frequency and there is substantial variation in document length. The first part of the discussion shows why such length variation can be a problem for frequency-based clustering. The second describes data normalizations to deal with the problem and shows that these are unreliable where documents are too short to provide accurate probability estimates for data variables. The third uses statistical sampling theory to develop a method for identifying and eliminating such documents from the analysis. Cluster analysis is used across many science and engineering disciplines to identify interesting structure in data; see for example Gan, Ma and Wu (2007, ch. 18), Xu & Wunsch (2009, pp. 8–12), and the extensive references to cluster analysis applications on the Web. The advent of digital electronic natural language text has seen its application in disciplines like information retrieval (Manning et al., 2008) and data mining (Feldman & Sanger, 2007) and, increasingly, in corpus-based linguistics (Moisl, 2009). In all these application areas, the reliability of cluster analytical results is contingent on the combination of the clustering algorithm being used and the characteristics of the data being analysed, where “reliability” is understood as the extent to which the result identifies structure that really is present in the domain from which the data was abstracted, and some well defined sense of what it means for structure to be “really present” is available. This discussion focuses on how the reliability of cluster analysis can be compromised by one particular characteristic of data abstracted from natural language corpora, and what to do about it. That characteristic arises when the aim is to cluster a collection of length-varying documents based on the frequency of occurrence of one or more linguistic or textual features; recent examples are clustering of the suras of the Qur'an on the basis of lexical frequency (Thabet, 2005) and of dialect speakers on the basis of phonetic segment frequency (Moisl et al., 2006). Because longer documents are likely to contain more examples of the features of interest than shorter ones, the frequencies of the data variables representing those features will be numerically larger for the longer documents than for the shorter ones, which in turn leads one to expect that the documents will cluster in accordance with relative length rather than with more interesting criteria latent in the data; this expectation has been empirically confirmed (for example Thabet, 2005). The solution is to eliminate relative document length as a factor by adjusting the data frequencies using a length normalization method. This is not a panacea, however. One or more documents in the collection might be too short to provide accurate population probability estimates for the variables, and, because length normalization methods exacerbate such inaccuracies, the result is that analysis based on the normalized data inaccurately clusters the documents in question. To deal with this problem, the present discussion proposes definition of a minimum length threshold for acceptably accurate variable probability estimation, and elimination of any documents which fall below that threshold from the analysis.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have