Abstract

ABSTRACTThis article provides a framework for assessing and quantifying “clusteredness” of a data representation. Clusteredness is a global univariate property defined as a layout diverging from equidistance of points to the closest neighboring point set. The OPTICS algorithm encodes the global clusteredness as a pair of clusteredness-representative distances and an algorithmic ordering. We use this to construct an index for quantification of clusteredness, coined the OPTICS Cordillera, as the norm of subsequent differences over the pair. We provide lower and upper bounds and a normalization for the index. We show the index captures important aspects of clusteredness such as cluster compactness, cluster separation, and number of clusters simultaneously. The index can be used as a goodness-of-clusteredness statistic, as a function over a grid or to compare different representations. For illustration, we apply our suggestion to dimensionality reduced 2D representations of Californian counties with respect to 48 climate change related variables. Online supplementary material is available (including an R package, the data and additional mathematical details).

Highlights

  • Introduction and MotivationRepresentation of a data matrix in Rm is an integral part of exploratory data analysis

  • We suggest a specific instance of the Cordillera utilizing the OPTICS (Ordering Points To Identify The Clustering Structure; Ankerst et al 1999) algorithm for obtaining the algorithmic ordering and reachabilities, which fits neatly into the distance-density based framework and has the properties of making only weak assumptions about the object arrangement in the representation

  • We find the OPTICS Cordillera and the plot of the clusteredness-representative ordering-reachability pair for the representations in the left column

Read more

Summary

Introduction and Motivation

Representation of a data matrix in Rm is an integral part of exploratory data analysis. While the approach of visually interpreting clusteredness data representations is common, it appears to be highly subjective. We suggest a specific instance of the Cordillera utilizing the OPTICS (Ordering Points To Identify The Clustering Structure; Ankerst et al 1999) algorithm for obtaining the algorithmic ordering and reachabilities, which fits neatly into the distance-density based framework and has the properties of making only weak assumptions about the object arrangement in the representation. We call this instance the OPTICS Cordillera, and we give results on its behavior.

Clusteredness
Distance-Density Based Clusteredness
A Clusteredness Index
Algorithmic Ordering and Reachabilities by OPTICS
The OPTICS Cordillera
Application
Software
Conclusion and Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.