Abstract
This article presents an empirical user study that compares eight multidimensional projection techniques for supporting the estimation of the number of clusters, [Formula: see text], embedded in six multidimensional data sets. The selection of the techniques was based on their intended design, or use, for visually encoding data structures, that is, neighborhood relations between data points or groups of data points in a data set. Concretely, we study: the difference between the estimates of [Formula: see text] as given by participants when using different multidimensional projections; the accuracy of user estimations with respect to the number of labels in the data sets; the perceived usability of each multidimensional projection; whether user estimates disagree with [Formula: see text] values given by a set of cluster quality measures; and whether there is a difference between experienced and novice users in terms of estimates and perceived usability. The results show that: dendrograms (from Ward’s hierarchical clustering) are likely to lead to estimates of [Formula: see text] that are different from those given with other multidimensional projections, while Star Coordinates and Radial Visualizations are likely to lead to similar estimates; t-Stochastic Neighbor Embedding is likely to lead to estimates which are closer to the number of labels in a data set; cluster quality measures are likely to produce estimates which are different from those given by users using Ward and t-Stochastic Neighbor Embedding; U-Matrices and reachability plots will likely have a low perceived usability; and there is no statistically significant difference between the answers of experienced and novice users. Moreover, as data dimensionality increases, cluster quality measures are likely to produce estimates which are different from those perceived by users using any of the assessed multidimensional projections. It is also apparent that the inherent complexity of a data set, as well as the capability of each visual technique to disclose such complexity, has an influence on the perceived usability.
Highlights
Visualizing the structure of a data set can be seen as an initial step toward gaining an understanding of the problem space represented by the data itself
We investigate the effects the aforementioned multidimensional projections (MDPs) have on user-driven estimations of k, their perceived usability for the task of estimating k, and whether they lead to an implicit agreement with the estimates given by NbClust
The results presented show that the local methods analyzed, Laplacian Eigenmaps (LE) and LLE are more likely to be influenced by small changes in both data and parameter variations, and they tend to provide cluttered visualizations, whereas data points in t-Stochastic Neighbor Embedding (SNE), Isomap, and Principal Component Analysis (PCA) are more scattered. t-SNE, due to the nature of its gradient, tends to form small clusters
Summary
Visualizing the structure of a data set can be seen as an initial step toward gaining an understanding of the problem space represented by the data itself. Scatter plots and scatter plot matrices are common examples for visually encoding data sets with dimensionalities between two and twelve.[3] For higher dimensional data sets, MDPs may rely on two types of unsupervised, machine learning (ML) techniques: DR and clustering. Both take a multidimensional data as input, and may produce an output which can later be plotted using a visual encoder (VE, used for visual encoding), for example, scatter plots or dendrograms. We argue that such would be the case of t-SNE, since its behavior as a general DR technique is uncertain,[17] and which is why it was presented by its authors as a visualization technique
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have