Abstract
The K-means clustering algorithm is well-known for its easy computational approach. In this algorithm, essential cluster-level information is captured by the K cluster centroids. However, how many such centroids can reveal the structure of the underlying data depends upon the choice of K. In this paper, we propose a clustering algorithm in which the number of cluster K can be learned as well as it performs the clustering. Our work revolves around two observations: i) a large-sized random sampled dataset may have a similar distribution as the original data, and ii) for the true number of clusters their centroids, generated from a sampled datasets, approximate the cluster centroids generated from the original dataset. The first observation has paved the way to provide a scalable solution, and the second one forms the key aspect of building the proposed algorithm. We have tested our method on several real and synthetic datasets. Our method can solve a few pertinent issues of clustering a dataset: 1) detection of a single cluster in the absence of any other cluster in a dataset, 2) the presence of hierarchy, 3) clustering of a high dimensional dataset, 4) robustness over dataset having cluster imbalance, and 5) robustness to noise. We have observed significant improvement in speed and quality for predicting cluster numbers as well as the composition of clusters in a large dataset.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.