Clustering is a fundamental unsupervised machine learning problem. However, due to limited processing memory and CPU power, it is challenging to cluster large-scale data. The distributed methods have received great attention in recent years since large-scale data can be stored and computed on multiple machines. In this paper, we study a variant of the k-center clustering problem, i.e., the lower-bounded k-center clustering problem (denoted as the Lb-k-Cen problem), in the Massively Parallel Computation (MPC) distributed model. The current best distributed result for the Lb-k-Cen problem has several rounds of communication between the coordinator and machines, which may increase the local computation and communication cost of the algorithm for handling large-scale data. To achieve fewer local computation and communication rounds, we use the threshold method and flow network technique, which avoid local computation again in each machine, and can achieve a two rounds (9+ϵ)-approximation algorithm in metric space. Moreover, we also consider the distributed algorithm for the Lb-k-Cen problem in the metric space with bounded doubling dimension, and propose a two rounds (3+ϵ)-approximation algorithm.
Read full abstract