Evaluating a class of distance-mapping algorithms for data mining and clustering

Jason Tsong-Li Wang,Bruce A. Shapiro,King-Ip Lin,Dennis Shasha,Xiong Wang,Kaizhong Zhang

doi:10.1145/312129.312264

Abstract

A distance-mapping algorithm takes a set of objects and a distance metric and then maps those objects to a Euclidean or pseudoEuclidean space in such a way that the distances among objects are approximately preserved. Distance mapping algorithms are a useful tool for clustering and visualization in data intensive applications, because they replace expensive distance calculations by sum-of-square calculations. This can make clustering in large databases with expensive distance metrics practical. In this paper we present five distance-mapping algorithms and conduct experiments to compare their performance in data clustering applications. These include two algorithms called FastMap and MetricMap, and three hybrid heuristics that combine the two algorithms in different ways. Experimental results on both synthetic and RNA data show the superiority of the hybrid algorithms. The results imply that FastMap and MetricMap capture complementary information about distance metrics and therefore can be used together to great benefit. The net effect is that multi-day computations may be done in minutes.

Full Text