Abstract
Somoclu is a massively parallel tool for training self-organizing maps on large data sets written in C++. It builds on OpenMP for multicore execution, and on MPI for distributing the workload across the nodes in a cluster. It is also able to boost training by using CUDA if graphics processing units are available. A sparse kernel is included, which is useful for high-dimensional but sparse data, such as the vector spaces common in text mining workflows. Python, R and MATLAB interfaces facilitate interactive use. Apart from fast execution, memory use is highly optimized, enabling training large emergent maps even on a single computer.
Highlights
Visual inspection of data is crucial to gain an intuition of the underlying structures
We tested an emergent map of 200 × 200 nodes, with the number of training instances ranging from 1,250 to 10,000
Emergent maps in the package kohonen are not possible, as the map is initialized with a sample from the data instances
Summary
Visual inspection of data is crucial to gain an intuition of the underlying structures. Self-organizing maps (SOMs) are a widespread visualization tool that embed high-dimensional data on a two-dimensional surface—typically a section of a plane or a torus—while preserving the local topological layout of the original data [9]. Tools exist that scale to large data sets using cluster resources [18], and combining GPU-accelerated nodes in clusters [27] Popular languages used in data analytics all have SOM modules, including MATLAB [24], Python [6], and R [25] Common to these tools is that they seldom make use of parallel computing capabilities, the batch formulation of SOM training invites such implementations. Distributing the workload across multiple nodes is an extension of the parallel formulation (Section 3.2)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have