Enhancing Document Clustering through Heuristics and Summary-Based Pre-processing

Sri Harsha Allamraju,Robert Chun

doi:10.1007/978-3-642-02559-4_12

Abstract

Knowledge workers are burdened with information overload. The information they need might be scattered in many places, buried in a file system, in their email, or on the web. Traditional Clustering algorithms help in assimilating these wide sources of information and generating meaningful relationships amongst them. A typical clustering preprocessing involves tokenization, removal of stop words, stemming, pruning etc. In this paper, we propose the use of summary and heuristics of a document as a pre-processing technique. This technique preserves the formatting of a document and uses this information for producing better clusters. In addition, only a summary of a document is used as the basis for clustering instead of the whole document. Clustering algorithms using the proposed pre-processing technique on formatted documents resulted in improved and more meaningful clusters.Keywordsdocument clusteringclusteringsummarizationheuristics

Full Text