An Analytical Assessment on Document Clustering

Pushplata ,Ram Chatterjee

doi:10.5815/ijcnis.2012.05.08

Abstract

Clustering is related to data mining for information retrieval. Relevant information is retrieved quickly while doing the clustering of documents. It organizes the documents into groups; each group contains the documents of similar type content. Document clustering is an unsupervised approach of data mining. Different clustering algorithms are used for clustering the documents such as partitioned clustering (K-means Clustering) and Hierarchical Clustering (Agglomerative Hierarchical Clustering (AHC)). This paper presents analysis of Suffix Tree Clustering (STC) Algorithm and other clustering techniques (K-means, AHC) that are being done in literature survey. The paper also focuses on traditional Vector Space Model (VSM) for similarity measures, which is used for clustering the documents. This paper also focuses on the comparison of different clustering algorithms. STC algorithm improves the searching performance as compare to other clustering algorithms as the papers studied in literature survey. The paper presents STC algorithm applied on the search result documents, which is stored in the dataset. This paper articulates the key requirements for web document clustering and clusters would be created on the full text of the web documents. STC perform the clustering and make the clusters based on phrases shared between the documents. STC is faster clustering algorithm for document clustering.

Full Text