Hyperclust: A Method for Finding Significant Hierarchical and Clinal Structure

J L Wong,R I C Hansell

doi:10.1093/sysbio/32.3.239

Abstract

-Hyperclust is a new clustering method with which we search for hierarchical structure in binary data. These hierarchical trees have the following properties. Any two objects (branches), that are joined at a node, share common characters in a manner consistent with a random allocation model. This model uses a character pool which is explicitly defined for every node. The test statistic (number of characters shared) follows a hypergeometric probability distribution. Furthermore alternative random allocation models can be tested by using different character pools. Finally by constructing alternative trees incorporating overlapping subsets of we can test whether the local structure within those subsets can be adequately described by a hierarchical model. [Statistical clustering; hierarchical clustering; classification.] We present a technique for exploring the structure in binary data sets. This method attempts to organize objects into hierarchical trees such that the structure of the trees is consistent with a random allocation model. Unlike most hierarchical clustering methods (Sneath and Sokal, 1973), this method produces trees which have been tested against alternative, more complex structures. Furthermore it is unique in that it tests whether the structures implied by the data form a consistent hierarchy. The clustering method, which we call Hyperclust, uses the hypergeometric probability distribution to generate a null or neutral hierarchical model against which an alternative explanation of the data can be compared. Hyperclust uses data with the following characteristics. The data consists of a set of each object being defined by the set of discrete binary characters that it possesses. In the example we present, the objects are rivers and the characters involve the presence or absence of different species of freshwater fish. Hyperclust builds trees which have the following characteristics. Any internal node in the tree will have two or more branches. The terminal branches of the tree represent the original (primitive) objects. These objects are joined at the highest nodes in the tree. However any branch can itself be thought of as an object. The characters of a branch are defined by the (set theoretic) union of the character sets of the terminal objects on that branch. New objects are formed every time a new node is formed. In order that a tree fits the Hyperclust model, it must meet the following two conditions. (1) Any two objects, joined at a common node, whether they are terminal objects or branches, must share a number of characters that is consistent with a random allocation model. (2) Any two objects arising from a node must be similar (i.e., share too many characters in common to have been joined at any lower node). Building a node in a tree implies not only structure but statistically significant structure on a local scale. The tree obtained from Hyperclust can be thought of as a chain of tested hypotheses. The similarity between two objects is measured relative to a random allocation model. The structure of the random allocation model controls the behavior of Hyperclust and, hence, the model will be described in greater detail. THE RANDOM ALLOCATION MODEL As with most concepts of similarity, the greater the number of characters shared, the greater the similarity. However, the similarity measured is not linearly related to the number of characters shared. Rather, similarity is measured as the probability of sharing the number of common characters when we assume a random allocation model. If we randomly select a set of N different charac-

Full Text