Abstract

Large scale Social Network Service (SNS) applications use graph as the default data organization model. To discover hidden useful information, data clustering techniques are widely adopted in data analysis tasks. State-of-the-art clustering algorithms are based on distributed programming paradigms such as MapReduce and Bulk Synchronization Parallel (BSP) models. These solutions suffer from heavy data transmission cost as well as excessive storage overhead problems. Thus, they can hardly achieve the optimal performance. In this paper, we propose a novel clustering algorithm based on pagerank to relieve these issues. Compare to current systems, our algorithm can achieve similar clustering results on large-scale graph data with much lower network and storage overhead. It consists of three steps. First, it calculates the pagerank values for each vertex based on the underlying data set. Then, we select multiple vertexes from the list as clustering centers. Third, our algorithm expands the range of the cluster by adding more neighboring vertexes iteratively. We compare our algorithm with other popular clustering algorithms using real world data sets and the results show that it achieves better performance than any other distributed solutions. For example, on a web graph data set with hundreds of thousands of vertexes and millions of edges, the time and memory consumption can be reduced by more than 95% and 70%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call