Improving the Performance of Multivariate Bernoulli Model based Documents Clustering Algorithms using Transformation Techniques

Pitchandi Pitchandi

doi:10.3844/jcssp.2011.762.769

Abstract

Problem statement: Document clustering is the most important areas of data mining since they are very much and currently the subject of significant global research since such areas strengthen the enterprises of web intelligence, web mining, web search engine design and so forth. Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Approach: This study explores the suitability of multivariate Bernoulli model based probabilistic algorithm for text clustering application. In a multivariate Bernoulli model, a document is represented as a binary vector over the space of words with 0 and 1, indicating that whether word occurs or not in the document. The number of occurrences is not considered. So the word frequency information is lost due to this nature of implementation. In this work, we propose a FFT based transformation technique for improving clustering performance of multivariate Bernoulli model based probabilistic algorithm. We are using the transformation technique to transform the actual term frequency count data in to a time domain signal. So, the weight of frequency of each word will be distributed throughout each row of records. Now if we apply multivariate Bernoulli model on values less than zero and greater than zero, the performance will get increased since there is no information loss in this kind of data representation. Results: In this work, Bernoulli model-based clustering and an improved version of the same will be implemented and evaluated using suitable metrics and the results are shown. Conclusion: The transformation technique in multivariate Bernoulli model improves the performance of document clustering significantly.

Highlights

Clustering: Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects
This study explores the suitability of multivariate Bernoulli model based probabilistic algorithm for text clustering application
In a multivariate Bernoulli model, a document is represented as a binary vector over the space of words with 0 and 1, indicating whether word occurs or not in the document

Summary

Introduction

Clustering: Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. Cluster analysis is an important human activity in which we indulge since childhood when we learn to distinguish between animals and plants by Clustering is a very important application area but widely interdisciplinary in nature, that makes it very difficult to define its scope. It is used in several research communities to describe methods for grouping of unlabeled data, these communities have different terminologies and assumptions for the components of the clustering process and the contexts in which clustering is used (Velmurugan et al, 2010; Jain et al, 1999).Cluster analysis has been studied extensively for years, focusing mainly on distance-based cluster analysis. Many clustering tools were made based on k-means, k-medoids and some of the methods were incorporated in many statistical analysis software packages (Han et al, 2011)

Methods

Results

Conclusion