Abstract

Somoclu is a massively parallel tool for training self-organizing maps on large data sets written in C++. It builds on OpenMP for multicore execution, and on MPI for distributing the workload across the nodes in a cluster. It is also able to boost training by using CUDA if graphics processing units are available. A sparse kernel is included, which is useful for high-dimensional but sparse data, such as the vector spaces common in text mining workflows. Python, R and MATLAB interfaces facilitate interactive use. Apart from fast execution, memory use is highly optimized, enabling training large emergent maps even on a single computer.

Highlights

  • Visual inspection of data is crucial to gain an intuition of the underlying structures

  • We tested an emergent map of 200 × 200 nodes, with the number of training instances ranging from 1,250 to 10,000

  • Emergent maps in the package kohonen are not possible, as the map is initialized with a sample from the data instances

Read more

Summary

Introduction

Visual inspection of data is crucial to gain an intuition of the underlying structures. Self-organizing maps (SOMs) are a widespread visualization tool that embed high-dimensional data on a two-dimensional surface—typically a section of a plane or a torus—while preserving the local topological layout of the original data [9]. Tools exist that scale to large data sets using cluster resources [18], and combining GPU-accelerated nodes in clusters [27] Popular languages used in data analytics all have SOM modules, including MATLAB [24], Python [6], and R [25] Common to these tools is that they seldom make use of parallel computing capabilities, the batch formulation of SOM training invites such implementations. Distributing the workload across multiple nodes is an extension of the parallel formulation (Section 3.2)

Parallelism
Workload in distributed environment
Command-line interface
As an application programming interface
Visualization
Experimental results
Single-node performance
Multi-node scaling
Visualization on real data
Limitations
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call