Abstract

Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing‐based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance‐based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

Highlights

  • In data-­rich scientific studies, it is often necessary to apply a clustering algorithm to detect groups of homogenous objects with respect to a set of descriptors

  • The only modification we made to the original Clarke et al (2008) algorithm was to use dissimilarities for the computation of the resemblance profile; this convention is consistent with the Fathom Toolbox for MATLAB (Jones, 2015), which was used for our testing and evaluations, and is advantageous because dissimilarity measures span a broad range of types that can be applied to a diversity of potential research disciplines

  • It is important to determine where unweighted pair group method with arithmetic mean (UPGMA) clustering, with dissimilarity profiles (DISPROF) implemented as a decision criterion, is affected by changes in data configuration, distribution, dispersion, and correlation

Read more

Summary

| INTRODUCTION

In data-­rich scientific studies, it is often necessary to apply a clustering algorithm to detect groups of homogenous objects with respect to a set of descriptors (i.e., measured variables). Detection of groups is useful in ecology, economics, genetics, and other disciplines that analyze large, multidimensional datasets. Clustering techniques for multivariate datasets are diverse and can be drawn from methods derived from.

SIMPROF decision criteria
| METHODS
| DISCUSSION
Findings
| CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call