Random centroid initialization for improving centroid-based clustering

Vadim Romanuke

doi:10.31181/dmame622023742

Abstract

A method for improving centroid-based clustering is suggested. The improvement is built on diversification of the k-means++ initialization. The k-means++ algorithm claimed to be a better version of k-means is tested by a computational set-up, where the dataset size, the number of features, and the number of clusters are varied. The statistics obtained on the testing have shown that, in roughly 50 % of instances to cluster, k-means++ outputs worse results than k-means with random centroid initialization. The impact of the random centroid initialization solidifies as both the dataset size and the number of features increase. In order to reduce the possible underperformance of k-means++, the k-means algorithm is run on a separate processor core in parallel to running the k-means++ algorithm, whereupon the better result is selected. The number of k-means++ algorithm runs is set not less than that of k-means. By incorporating the seeding method of random centroid initialization, the k-means++ algorithm gains about 0.05 % accuracy in every second instance to cluster.

Full Text