Particle swarm optimization for large-scale clustering on apache spark

Matthew Sherar,Farhana Zulkernine

doi:10.1109/ssci.2017.8285208

Abstract

We present a particle swarm optimization (PSO) clustering algorithm implemented in Apache Spark to achieve parallel big data clustering. Apache Spark is an in-memory big data analytics framework which uses parallel distributed processing to analyze large amount of data faster than most other existing data analytic tools. Spark's library of data analytic functions does not include the PSO algorithm. PSO is an evolutionary computing technique that has shown to produce more compact clusters than other partitional clustering techniques for a wide range of data. In addition PSO is a paralellizable and customizable algorithm well suited for multi-objective clustering problems. In this paper we present our implementation of a hybrid K-Means PSO (KMPSO) clustering algorithm in Apache Spark and demonstrate the performance gained in Spark by comparing our implementation with an implementation of KMPSO in MATLAB. We demonstrate that KMPSO can produce better clustering results than Spark's built-in clustering algorithms, and that Apache Spark enables efficient scaling of resources to handle large and complex workloads.

Full Text