A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data.

Ali Seyed Shirkhorshidi,Saeed Aghabozorgi,Teh Ying Wah,Andrew R Dalby

doi:10.1371/journal.pone.0144059

Ali Seyed Shirkhorshidi, Saeed Aghabozorgi + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0144059

Copy DOI

Journal: PloS one	Publication Date: Dec 11, 2015
Citations: 273	License type: CC BY 4.0

Affiliation: University of Malaya, IBM (Canada)

Abstract

Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.

Highlights

One of the biggest challenges of this decade is with databases having a variety of data types
There are various studies available for comparing similarity/distance measures for clustering numerical data, but there are two difference between this study and other existing studies and related works: first, the aim in this study is to investigate the similarity/distance measures against low dimensional and high dimensional datasets and we wanted to analyse their behaviour in this context
Regarding the discussion on Rand index and iteration count, it is manifested that the Average measure is accurate in most datasets and with both k-means and k-medoids algorithms, but it is the second fastest similarity measure after Pearson in terms of convergence, making it a secure choice when clustering is necessary using k-means or k-medoids algorithms

Summary

Introduction

One of the biggest challenges of this decade is with databases having a variety of data types. The distance measure is a main component of distance-based clustering algorithms. To reveal the influence of various distance measures on data mining, researchers have done experimental studies in various fields and have compared and evaluated the results generated by different distance measures. Deshpande et al focused on data from a single knowledge area, for example biological data, and conducted a comparison in favor of profile similarity measures for genetic interaction networks. They concluded that the Dot Product is consistent among the best measures in different conditions and genetic interaction datasets [22]

Objectives

Methods

Results

Conclusion