Text Document Clustering: The Application of Cluster Analysis to Textual Document

Venkata Srikanth Reddy,Patrick Kinnicutt,Roger Lee

doi:10.1109/csci.2016.0222

Abstract

Gathering the most relevant data for one's need, from the huge collection of data in the internet is a work of great difficult. To make it easier, we propose an application called text clustering, which is an automatic grouping of text documents into clusters, so that documents within a cluster defines the similarity between them, but they are not similar to documents in other clusters. Most of existing text clustering algorithms uses the traditional vector space model, which treats documents as group of words while the word sequences in the documents are ignored and the meaning of natural languages strongly depends on them. Our first objective is to implement a clustering algorithm in java, named Clustering based on Frequent Word Sequences. The frequent word sequences can provide compact and valuable information about the text documents. Our second objective is to use an association rule miner[13] to find the frequent two-word sets that satisfy the minimum support using Apriori Algorithm[2,5]. Our results will show that the finally compact documents will be more accurate and precise than the regular method documents.

Full Text