The suitable distance function for fuzzy C-Means clustering

Joko Eliyanto,Sugiyarto Surono

doi:10.1063/5.0106185

Abstract

Fuzzy C-Means clustering is a form of clustering based on distance which apply the concept of fuzzy logic. The clustering process works simultaneously with the iteration process to minimize the objective function. This objective function is the summation from the multiplication of the distance between the data coordinates to the nearest cluster centroid with the degree of which the data belong to the cluster itself. Based on the objective function equation, the value of the objective function will decrease by increasing the number of iteration process. This research provide how we choose the suitable distance for Fuzzy C-Means clustering. The right distance will meet the optimization problem in the Fuzzy C-Means Clustering method and produce good cluster quality. They are Euclidean, Average, Manhattan, Chebisev, Minkowski, Minkowski-Chebisev, and Canberra distance. We use five UCI Machine Learning dataset and two random datasets. We use the Lagrange multiplier method for the optimization of this method. The result quality of the cluster measure by their accuracy, Davies Bouldin Index, purity, and adjusted rand index. The experiment shows that the Canbera distances are the best distances which provide the optimum result by producing minimum objective function 378.185. The suitable distance for the application of the Fuzzy C-Means Clustering method are Euclidean distance, Average distance, Manhattan distance, Minkowski distance, Minkowski-Chebisev distance, and Canberra distance. These six distances produce a numerical simulation that derives the objective function fairly constant. Meanwhile, the Chebisev distance shows the movement of the value of the objective function that fluctuates, so it is not in accordance with the optimization problem in the Fuzzy C Means Clustering method.

Full Text