An intuitive based clustering on document mining using dirichlet process mixture models and its kernels

D Ratnam,B V Subba Rao,J Rajendra Prasad,S Sai Kumar

doi:10.1109/scopes.2016.7955646

Abstract

In machine learning and data mining tasks, Clustering is considered to be one of the most important techniques. The same sorts of documents are grouped by performing clustering techniques. Similarity measuring is used to determine transaction relationships. Hierarchical clustering model generates tree structured results. Partitioned based clustering produces the result in grid format. Text documents are unstructured data values with high dimensional attributes. Document clustering group transforms unlabeled text documents into meaningful clusters. In the event of document grouping process, traditional clustering methods require cluster count (K). Clustering accuracy degrades drastically with reference to the unsuitable cluster count. It is observed that document features are automatically partitioned into two groups namely — discriminative words and non-discriminative words. In particular, discriminative words are only useful for grouping documents. The involvement of non-discriminative words confuses the clustering process and leads to poor clustering solution in return. A variation inference algorithm is used to infer the document collection structure and partition of document words simultaneously. Dirichlet Process Mixture (DPM) model is used to partition documents in a way utilizing both the data likelihood and the clustering property of the Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Extraction (DPMFE) is used to discover the latent cluster structure based on the DPM model and it is performed without involving the number of clusters as input.

Full Text