Two improved k-means algorithms

Shyr-Shen Yu,Shao-Wei Chu,Chuin-Mu Wang,Yung-Kuan Chan,Ting-Cheng Chang

doi:10.1016/j.asoc.2017.08.032

Abstract

K-means algorithm is the most commonly used simple clustering method. For a large number of high dimensional numerical data, it provides an efficient method for classifying similar data into the same cluster. In this study, a tri-level k-means algorithm and a bi-layer k-means algorithm are proposed. The k-means algorithm is vulnerable to outliers and noisy data, and also susceptible to initial cluster centers. The tri-level k-means algorithm can overcome these drawbacks. While the data in a dataset S are often changed, after a period of time the trained cluster centers cannot precisely describe the data in each cluster. The cluster centers hence need to be updated. In this paper, an online machine learning based tri-level k-means algorithm is also provided to solve this problem. When the data in a cluster are significantly different, a cluster center cannot alone precisely describe each datum in the cluster. Noisy data, outliers, and data with quite different values in the same cluster may decrease the performance of pattern matching systems. The bi-layer k-means algorithm can deal with the above problems. Meanwhile, a genetic-based algorithm is provided to derive the fittest parameters used in the tri-level and bi-layer k-means algorithms. Experimental results demonstrate that both algorithms can provide much better accuracy of classification than the traditional k-means algorithm.

Full Text