Abstract

A simple and fast k-medoids algorithm that updates medoids by minimizing the total distance within clusters has been developed. Although it is simple and fast, as its name suggests, it nonetheless has neglected local optima and empty clusters that may arise. With the distance as an input to the algorithm, a generalized distance function is developed to increase the variation of the distances, especially for a mixed variable dataset. The variation of the distances is a crucial part of a partitioning algorithm due to different distances producing different outcomes. The experimental results of the simple k-medoids algorithm produce consistently good performances in various settings of mixed variable data. It also has a high cluster accuracy compared to other distance-based partitioning algorithms for mixed variable data.

Highlights

  • Cluster analysis is a vital exploratory tool in data structure investigation

  • The most common practice for a mixed variable dataset is applying the partitioning around medoids (PAM) [2], which replaces the centroids with the medoids

  • By taking local optima and empty clusters into consideration, we propose a k-medoids algorithm that improves the performance of the simple and fast k-medoids (SFKM) algorithm (SFKM)

Read more

Summary

Introduction

Cluster analysis is a vital exploratory tool in data structure investigation. Each object within a group is similar (homogeneous) to each other, and objects between groups are distinct (heterogeneous) from one another [1,2]. The k-means algorithm, is irrelevant when the data are mixed variable data because “means” as the center of the clusters (centroid) are unavailable and the Euclidean distance is not applicable. The most common practice for a mixed variable dataset is applying the partitioning around medoids (PAM) [2], which replaces the centroids with the medoids. After defining the distance for the mixed variable data, either the k-prototype or PAM algorithm is applied. PAM as a medoid-based algorithm is more robust with respect to the cluster center definition than centroid-based algorithms. In the medoid updating step, on the other hand, they are very similar Both algorithms operate within clusters only, like centroid updating in k-means, for the medoid updating. We generalize a distance function, which is feasible for any numerical and categorical distance combination and its respective weight

K-Medoids Algorithms
Proposed K-Medoids Algorithm
Proposed Distance Method
Demonstration on Artificial and Real Datasets
Different Variable Proportions
Different Number of Clusters
Different Numbers of Variables
Different Numbers of Objects
Iris Data
Wine Data
Vote Data
Zoo Data
Credit Approval Data
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.