A Novel Clustering Algorithm on Large-Scale Graph Data

Hao Zhang,Ge Fu,Wei Zhou,Zhiyong Xu,Jizhong Han,Xiaoyu Wan

doi:10.1109/ccbd.2014.23

Abstract

Large scale Social Network Service (SNS) applications use graph as the default data organization model. To discover hidden useful information, data clustering techniques are widely adopted in data analysis tasks. State-of-the-art clustering algorithms are based on distributed programming paradigms such as MapReduce and Bulk Synchronization Parallel (BSP) models. These solutions suffer from heavy data transmission cost as well as excessive storage overhead problems. Thus, they can hardly achieve the optimal performance. In this paper, we propose a novel clustering algorithm based on pagerank to relieve these issues. Compare to current systems, our algorithm can achieve similar clustering results on large-scale graph data with much lower network and storage overhead. It consists of three steps. First, it calculates the pagerank values for each vertex based on the underlying data set. Then, we select multiple vertexes from the list as clustering centers. Third, our algorithm expands the range of the cluster by adding more neighboring vertexes iteratively. We compare our algorithm with other popular clustering algorithms using real world data sets and the results show that it achieves better performance than any other distributed solutions. For example, on a web graph data set with hundreds of thousands of vertexes and millions of edges, the time and memory consumption can be reduced by more than 95% and 70%, respectively.

Full Text