Text document clustering based on frequent word sequences

Yanjun Li,Soon M Chung

doi:10.1145/1099554.1099633

Abstract

In this paper, we propose a new text clustering algorithm, named Clustering based on Frequent Word Sequences (CFWS). A word sequence is frequent if it occurs in more than certain percentage of the documents in the text database. In the past, the vector space model was commonly used for information retrieval, but it treats documents as bags of words, ignoring the sequential pattern of word occurrences in the documents. However, the meaning of natural languages strongly depends on the word sequences, and the frequent word sequences can provide compact and valuable information about the text database. Bisecting k-means and FIHC algorithms are evaluated on the performance of text clustering, and are compared with the proposed CFWS algorithm. It has been shown that CFWS has much better performance.

Full Text