Abstract

Clustering algorithms belong to major topics in big data analysis. Their main goal is to separate an unlabelled dataset into several subsets, with each subset ideally characterized by some unique characteristic of its data structure. Common clustering approaches cannot impose constraints on sizes of clusters. However, in many applications, sizes of clusters are bounded or known in advance. One of the more recent robust clustering algorithms is called neural gas which is popular, for example, for data compression and vector quantization used in speech recognition and signal processing. In this paper, we have introduced an adapted neural gas algorithm able to accommodate requirements for the size of clusters. The convergence of algorithm towards an optimum is tested on simple illustrative examples. The proposed algorithm provides better statistical results than its direct counterpart, balancedk-means algorithm, and, moreover, unlike the balancedk-means, the quality of results of our proposed algorithm can be straightforwardly controlled by user defined parameters.

Highlights

  • Data amount in various disciplines, ranging from bioinformatics to web documents, increases nonlinearly each year

  • In the first iteration of the learning, the assignment of observations to clusters is not yet ready, so the adaptation of the vectors assigned to the centers remains the same as in the original neural gas algorithm, described by equations (1) and (2)

  • We sort the centers by the distance from the selected point in the same way as in the classical neural gas algorithm, but we change the sequence by moving that center to the front of the sequence to which the currently selected point has been assigned in the previous iteration

Read more

Summary

Introduction

Data amount in various disciplines, ranging from bioinformatics to web documents, increases nonlinearly each year. To exploit these data and to extract knowledge from them, their effective processing is necessary. Big data analysis contains cluster analysis together with clustering algorithms as its major topic. The goal of unsupervised clustering as a data mining task is to separate an unlabelled dataset of “observations” into several sets, where each separate set is ideally characterized by its unique hidden data structure. Since a definition of the principle underlying such a data structure is subjective, there does not exist the best clustering algorithm or the best definition of a cluster. Among major approaches to clustering belong hierarchical, partitional, neural network-based or kernel-based clustering [1]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call