Scaling the Growing Neural Gas for Visual Cluster Analysis

Elio Ventocilla,Rafael M Martins,Fernando Paulovich,Maria Riveiro

doi:10.1016/j.bdr.2021.100254

Elio Ventocilla, Rafael M Martins + Show 2 more

Open Access

PDF Available

https://doi.org/10.1016/j.bdr.2021.100254

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

• Preventing high-dimensional graph structures in GNG leads to more accurate and meaningful cluster representations. • Using the MapReduce model allows training GNG over distributed infrastructures while preserving or improving accuracy. • Sampling further reduces GNG training times while preserving or improving accuracy. The growing neural gas (GNG) is an unsupervised topology learning algorithm that models a data space through interconnected units that stand on the most populated areas of that space. Its output is a graph that can be visually represented on a two-dimensional plane, disclosing cluster patterns in datasets. It is common, however, for GNG to result in highly connected graphs when trained on high-dimensional data, which in turn leads to highly cluttered 2D representations that may fail to disclose meaningful patterns. Moreover, its sequential learning limits its potential for faster executions on local datasets, and, more importantly, its potential for training on distributed datasets while leveraging from the computational resources of the infrastructures in which they reside. This paper presents two methods that improve GNG for the visualization of cluster patterns in large-scale and high-dimensional datasets. The first one focuses on providing more accurate and meaningful 2D visual representations for cluster patterns of high-dimensional datasets, by avoiding connections that lead to high-dimensional graphs in the modeled topology which may, in turn, result in overplotting and clutter. The second method presented in this paper enables the use of GNG on big and distributed datasets with faster execution times, by modeling and merging separate parts of a dataset using the MapReduce model. Quantitative and qualitative evaluations show that the first method leads to the creation of lower-dimensional graph structures that provide more meaningful (and sometimes more accurate) cluster representations with less overplotting and clutter; and that the second method preserves the accuracy and meaning of the cluster representations while enabling its execution in large-scale and distributed settings.

Highlights

A common problem in exploratory data analysis—a process that relies on limited preconceived assumptions [1]—is the investigation of cluster patterns in datasets [2], i.e., uncovering groups of instances which form neighborhoods according to a given similarity or distance metric
The results are described in two parts: first, in terms of example embeddings (2D visual representations) of the S, M, and L datasets using different growing neural gas (GNG) versions; and second, in terms of the performance metrics
The first method improves the modeling performance and the meaningfulness of the cluster pattern representations generated by GNG by restricting the created connections during training and, as a result, by creating lower-dimensional graph structures that reduce overplotting and cluttering in the two-dimensional plane

Summary

Introduction

A common problem in exploratory data analysis—a process that relies on limited preconceived assumptions [1]—is the investigation of cluster patterns in datasets [2], i.e., uncovering groups of instances which form neighborhoods according to a given similarity or distance metric. As datasets grow in terms of size (number of data points) and dimensionality (number of features), it becomes more challenging to uncover such cluster patterns in a human-interpretable and usable way. High-dimensional datasets are more difficult to model because data points are sparser in the multidimensional space (a.k.a., the curse of dimensionality [9]). This makes modeled distances less meaningful, affecting the representativeness of the subsequent visual representations. Bigger datasets (i.e. with millions of data points) entail even more usability challenges, such as avoiding overplotting and clutter in the visual representations or providing system feedback from both learning processes and user-triggered interactions under a given time threshold (i.e., 10 seconds for task-preserving latency) [10,11,12]. Take an input signal (data point) δ from the dataset

Methods

Results

Discussion

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Big Data Research	Publication Date: Apr 6, 2021
Citations: 6	License type: cc-by

R Discovery Prime

Scaling the Growing Neural Gas for Visual Cluster Analysis

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Big Data Research

Lead the way for us

Similar Papers

Growing Neural Gas – A Parallel Approach
Lukáš Vojáček ... Jiří Dvorský
-
Lukáš Vojáček, et. al.Lukáš Vojáček ... Jiří Dvorský
01 Jan 2013
01 Jan 2013

Combination of Self Organizing Maps and Growing Neural Gas
Lukáš Vojáček ... Jiří Dvorský
-
Lukáš Vojáček, et. al.Lukáš Vojáček ... Jiří Dvorský
01 Jan 2014
01 Jan 2014

Learning Topologies with the Growing Neural Forest.
Esteban José Palomo ... Ezequiel López-Rubio
International journal of neural systems | VOL. 26
Esteban José Palomo, et. al.Esteban José Palomo ... Ezequiel López-Rubio
27 Apr 2016
International journal of neural systems | VOL. 26

Evolution of SOMs’ Structure and Learning Algorithm: From Visualization of High-Dimensional Data to Clustering of Complex Data
Marian B Gorzałczany ... Filip Rudziński
Algorithms | VOL. 13
Marian B Gorzałczany, et. al.Marian B Gorzałczany ... Filip Rudziński
28 Apr 2020
Algorithms | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Scaling the Growing Neural Gas for Visual Cluster Analysis

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Big Data Research