In this work we discuss several methods used in the clustering of obj ects which can be represented as points in Euclidean space. Moreover, only those procedures are considered that lead to partition of a sample into a preestablished number of groups, although some ideas are also valid for hierarchical schemes (Lance and Williams [1966]). There are two major approaches to this problem. In the first one (which could be called metric) a measure of distance is defined between each point x of the sample and each subset G thereof (an (i, k)-measure in the terminology of Lance and Williams [1968]). An initial partition is generated (for instance at random, or based on some external considerations) and then the points are successively reassigned to the nearest group, until no more reallocations are possible (or desired). In the second approach, each partition is assigned a numerical value, usually measuring the reduction of uncertainty due to grouping, and a search is made for partitions that optimize this functional (Friedman and Rubin [1967], Rubin [1967]). The simplest and most commonly used metric is the Euclidean distance between x and the mean of G (Ball and Hall [1967], Forgy [1965], MacQueen [1967]). In spite of its practical advantages, it has the drawback that the results it produces are not invariant under nonsingular linear transformations of the data (e.g., changes in the measurement units). One could avoid this drawback if he knew the within groups covariance matrix W corresponding to the correct partition he was looking for. Then the most natural metric would be Mahalanobis distance based on W. A possible solution to this dilemma is at each step to use the Mahalanobis metric induced by the matrix W corresponding to the current partition. A second conceivable improvement is obtained when one tries to overcome the fact that in the two former cases the metric is the same for all groups. If he suspects that the true clusters may have very different covariance structures, it would be reasonable to let each cluster have its own metric, based on its current covariance matrix. This was advocated by Chernoff [1970] and Rohlf [1970].
Read full abstract