A Novel Center Point Initialization Technique for K-means Clustering Algorithm

Dauda Usman,Ismail Bin Mohamad

doi:10.5539/mas.v7n9p10

Abstract

Clustering is a major data analysis tool utilized in numerous domains. The basic K-means method has been widely discussed and applied in many applications. But unfortunately failed to offer good clustering result due to the initial center points are chosen randomly. In this article, we present a new method of centre points initialization and we prove that the distance of the new method follows a Chi-square distribution. The new method overcomes the drawbacks of the basic K-means. Experimental analysis shows that the new method performs well on infectious diseases dataset when compare with the basic K-means clustering method and a histogram measures the quality of the new method.

Highlights

The massive quantity of information gathered and input into databases brings up the necessity of efficient exploration technique which can utilize the information contained unconditionally there
We show that the new method is normal and follows a Chi-square distribution
We evaluate the accuracy of the two approaches, whereby accuracy is measured by the error sum of squares for the intra-cluster range, that is a distance between data vectors in a group and the centroid of the cluster, the smaller the sum of the differences is, the better the accuracy of clustering

Summary

Introduction

The massive quantity of information gathered and input into databases brings up the necessity of efficient exploration technique which can utilize the information contained unconditionally there. Among the initial data exploration work is clustering, which enables a person to comprehend pattern and natural groupings within the datasets. Enhancing clustering techniques continues to be getting a lot of interest. The aim would be to cluster the items in the databases to some group of significant subclasses (Ankerst et al, 1999). The exploration may be carried out in different databases as well as data repositories, though the styles available were laid out in different exploration benefits such as concept/class description, classification, association, prediction, correlation analysis, cluster analysis and so on

Methods

Results

Conclusion