A distributed approximate nearest neighbors algorithm for efficient large scale mean shift clustering

Gaël Beck,Tarn Duong,Mustapha Lebbah,Hanane Azzag,Christophe Cérin

doi:10.1016/j.jpdc.2019.07.015

Abstract

Mean Shift clustering, as a generalization of the well-known k-means clustering, computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential for improved clustering accuracy, the Mean Shift approach is a computationally expensive method for unsupervised learning. We introduce two contributions aiming to provide approximate Mean Shift clustering, based on scalable procedures to compute the density gradient ascent and cluster labeling, with a linear time complexity, as opposed to the quadratic time complexity for the exact clustering. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. When implemented on a serial system, these approximate methods can be used for moderate sized datasets. To facilitate the analysis of Big Data, a distributed implementation, written for the Spark/Scala ecosystem is proposed. An added benefit is that our proposed approximations of the density gradient ascent, when used as a pre-processing step in other clustering methods, can also improve the clustering accuracy of the latter. We present experimental results illustrating the effect of tuning parameters on cluster labeling accuracy and execution times, as well as the potential to solve concrete problems in Big Clustering.

Full Text