Abstract

The K-means clustering algorithm is well-known for its easy computational approach. In this algorithm, essential cluster-level information is captured by the K cluster centroids. However, how many such centroids can reveal the structure of the underlying data depends upon the choice of K. In this paper, we propose a clustering algorithm in which the number of cluster K can be learned as well as it performs the clustering. Our work revolves around two observations: i) a large-sized random sampled dataset may have a similar distribution as the original data, and ii) for the true number of clusters their centroids, generated from a sampled datasets, approximate the cluster centroids generated from the original dataset. The first observation has paved the way to provide a scalable solution, and the second one forms the key aspect of building the proposed algorithm. We have tested our method on several real and synthetic datasets. Our method can solve a few pertinent issues of clustering a dataset: 1) detection of a single cluster in the absence of any other cluster in a dataset, 2) the presence of hierarchy, 3) clustering of a high dimensional dataset, 4) robustness over dataset having cluster imbalance, and 5) robustness to noise. We have observed significant improvement in speed and quality for predicting cluster numbers as well as the composition of clusters in a large dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call