Abstract

• Preventing high-dimensional graph structures in GNG leads to more accurate and meaningful cluster representations. • Using the MapReduce model allows training GNG over distributed infrastructures while preserving or improving accuracy. • Sampling further reduces GNG training times while preserving or improving accuracy. The growing neural gas (GNG) is an unsupervised topology learning algorithm that models a data space through interconnected units that stand on the most populated areas of that space. Its output is a graph that can be visually represented on a two-dimensional plane, disclosing cluster patterns in datasets. It is common, however, for GNG to result in highly connected graphs when trained on high-dimensional data, which in turn leads to highly cluttered 2D representations that may fail to disclose meaningful patterns. Moreover, its sequential learning limits its potential for faster executions on local datasets, and, more importantly, its potential for training on distributed datasets while leveraging from the computational resources of the infrastructures in which they reside. This paper presents two methods that improve GNG for the visualization of cluster patterns in large-scale and high-dimensional datasets. The first one focuses on providing more accurate and meaningful 2D visual representations for cluster patterns of high-dimensional datasets, by avoiding connections that lead to high-dimensional graphs in the modeled topology which may, in turn, result in overplotting and clutter. The second method presented in this paper enables the use of GNG on big and distributed datasets with faster execution times, by modeling and merging separate parts of a dataset using the MapReduce model. Quantitative and qualitative evaluations show that the first method leads to the creation of lower-dimensional graph structures that provide more meaningful (and sometimes more accurate) cluster representations with less overplotting and clutter; and that the second method preserves the accuracy and meaning of the cluster representations while enabling its execution in large-scale and distributed settings.

Highlights

  • A common problem in exploratory data analysis—a process that relies on limited preconceived assumptions [1]—is the investigation of cluster patterns in datasets [2], i.e., uncovering groups of instances which form neighborhoods according to a given similarity or distance metric

  • The results are described in two parts: first, in terms of example embeddings (2D visual representations) of the S, M, and L datasets using different growing neural gas (GNG) versions; and second, in terms of the performance metrics

  • The first method improves the modeling performance and the meaningfulness of the cluster pattern representations generated by GNG by restricting the created connections during training and, as a result, by creating lower-dimensional graph structures that reduce overplotting and cluttering in the two-dimensional plane

Read more

Summary

Introduction

A common problem in exploratory data analysis—a process that relies on limited preconceived assumptions [1]—is the investigation of cluster patterns in datasets [2], i.e., uncovering groups of instances which form neighborhoods according to a given similarity or distance metric. As datasets grow in terms of size (number of data points) and dimensionality (number of features), it becomes more challenging to uncover such cluster patterns in a human-interpretable and usable way. High-dimensional datasets are more difficult to model because data points are sparser in the multidimensional space (a.k.a., the curse of dimensionality [9]). This makes modeled distances less meaningful, affecting the representativeness of the subsequent visual representations. Bigger datasets (i.e. with millions of data points) entail even more usability challenges, such as avoiding overplotting and clutter in the visual representations or providing system feedback from both learning processes and user-triggered interactions under a given time threshold (i.e., 10 seconds for task-preserving latency) [10,11,12]. Take an input signal (data point) δ from the dataset

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call