Efficient phrase-based document indexing for Web document clustering

K.M Hammouda,M.S Kamel

doi:10.1109/tkde.2004.58

Abstract

Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This article presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the document index graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient phrase-based document indexing for Web document clustering

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering

Lead the way for us

Journal: IEEE Transactions on Knowledge and Data Engineering	Publication Date: Oct 1, 2004
Citations: 350

Similar Papers

A collaborative filtering-based approach to personalized document clustering
Chih-Ping Wei ... Han-Wei Hsiao
Decision Support Systems | VOL. 45
Chih-Ping Wei, et. al.Chih-Ping Wei ... Han-Wei Hsiao
18 May 2007
Decision Support Systems | VOL. 45

Combining preference- and content-based approaches for improving document clustering effectiveness
Chih-Ping Wei ... Han-Wei Hsiao
Information Processing & Management | VOL. 42
Chih-Ping Wei, et. al.Chih-Ping Wei ... Han-Wei Hsiao
24 Aug 2005
Information Processing & Management | VOL. 42

Web Document Clustering Using Document Index Graph
B F Momin ... Amol Chaudhari
-
B F Momin, et. al.B F Momin ... Amol Chaudhari
01 Dec 2006
01 Dec 2006

Phrase-based document similarity based on an index graph model
K.M Hammouda ... M.S Kamel
-
K.M Hammouda, et. al.K.M Hammouda ... M.S Kamel
09 Dec 2002
09 Dec 2002

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient phrase-based document indexing for Web document clustering

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering