Large-Scale Clustering Algorithms

Rocco Langone,Johan A K Suykens,Vilen Jumutc

doi:10.1007/978-3-319-53474-9_1

Abstract

Computational tools in modern data analysis must be scalable to satisfy business and research time constraints. In this regard, two alternatives are possible: (i) adapt available algorithms or design new approaches such that they can run on a distributed computing environment (ii) develop model-based learning techniques that can be trained efficiently on a small subset of the data and make reliable predictions. In this chapter two recent algorithms following these different directions are reviewed. In particular, in the first part a scalable in-memory spectral clustering algorithm is described. This technique relies on a kernel -based formulation of the spectral clustering problem also known as kernel spectral clustering . More precisely, a finite dimensional approximation of the feature map via the Nystrom method is used to solve the primal optimization problem, which decreases the computational time from cubic to linear. In the second part, a distributed clustering approach with fixed computational budget is illustrated. This method extends the k-means algorithm by applying regularization at the level of prototype vectors. An optimal stochastic gradient descent scheme for learning with \(l_1\) and \(l_2\) norms is utilized, which makes the approach less sensitive to the influence of outliers while computing the prototype vectors.

Full Text