Abstract

Similarity and dissimilarity (distance) between objects is an important aspect that must be considered when clustering data. When clustering categorical data, for instance, these distance (similarity or dissimilarity) measures need to address properly the real particularities of categorical data. In this paper, we perform a comparative analysis with four different dissimilarity measures used as a distance metric for clustering categorical data. The first one is the Simple Matching Dissimilarity Measure (SMDM), which is one of the simplest and the most used metric for categorical attribute. The other two are context-based approaches (DIstance Learning in Categorical Attributes - DILCA and Domain Value Dissimilarity-DVD), and the last one is an extension of the SMDM, which is proposed in this paper. All four dissimilarities are applied as distance metrics in two well known clustering algorithms, k-means and agglomerative hierarchical clustering algorithms. In this analysis, we also use internal and external cluster validity measures, aiming to compare the effectiveness of all four distance measures in both clustering algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call