Cluster analysis for two-level data sets: Classifying tree species from individual measurements

Nicolas Picard,Avner Bar-Hen

doi:10.1016/j.ecoinf.2013.02.001

Abstract

Two-level data sets consist of higher level (say population) traits computed from lower level (say individual) observations. Cluster analysis for two-level data sets aims at classifying populations using individual observations. Most existing techniques to classify populations in two-level data sets actually operate on population traits (e.g. the k-means algorithm), thus disregarding the within-population individual variability. In this study, the k-means algorithm was compared with a recently developed classification method that accounts for within-population variability. Populations were tree species in a tropical rain forest in French Guiana, and individual observations were tree diameters and diameter growth rates. Tree species were classified according to either their diameter and growth rate, or to their asymptotic diameter distribution as predicted by an Usher matrix population model. In both cases, the k-means algorithm and the two-level classification method defined species clusters that were significantly related according to the Rand index. Nevertheless, clusters showed increasing differences between the two methods as the within-population individual variability increased. Whereas the k-means algorithm produced equally-sized spherical clusters, the two-level classification method adapted the size and shape of clusters to the individual within-population variability. Taking account of individual variability to classify populations in ecology may thus be important, albeit rarely done.

Full Text