Abstract

Graph clustering is an important technique to understand the relationships between the vertices in a big graph. In this paper, we propose a novel random-walk-based graph clustering method. The proposed method restricts the reach of the walking agent using an inflation function and a normalization function. We analyze the behavior of the limited random walk procedure and propose a novel algorithm for both global and local graph clustering problems. Previous random-walk-based algorithms depend on the chosen fitness function to find the clusters around a seed vertex. The proposed algorithm tackles the problem in an entirely different manner. We use the limited random walk procedure to find attractor vertices in a graph and use them as features to cluster the vertices. According to the experimental results on the simulated graph data and the real-world big graph data, the proposed method is superior to the state-of-the-art methods in solving graph clustering problems. Since the proposed method uses the embarrassingly parallel paradigm, it can be efficiently implemented and embedded in any parallel computing environment such as a MapReduce framework. Given enough computing resources, we are capable of clustering graphs with millions of vertices and hundreds millions of edges in a reasonable time.

Highlights

  • Graph data are important data types in many scientific areas, such as social network analysis, bioinformatics, and computer and information network analysis [1]

  • We analyze the behavior of the limited random walk procedure and propose a novel algorithm for both global and local graph clustering problems

  • We propose a novel random-walk-based graph clustering algorithm—the limited random walk (LRW) algorithm

Read more

Summary

Background

Graph data are important data types in many scientific areas, such as social network analysis, bioinformatics, and computer and information network analysis [1]. Graph clustering ( named as “community detection” in the literature) algorithms aim to reveal the heterogeneity and find the underlying relations between vertices [2] This technique is critical for understanding the properties, predicting dynamic behavior and improving visualization of big graph data. Newman defined a modularity measurement based on the probability of the link between any two vertices He applied a greedy search method to minimize this modularity fitness function in order to partition a graph into clusters [5]. The accuracy of any criteria-based clustering method (or those combined with the random walk procedures) is greatly affected by the chosen clustering fitness function. Most local clustering algorithms use the criteria that are more suitable for the global graph clustering problem These choices greatly degrade the performance of these algorithms when the graph is big and highly uneven. The rest of the paper is organized as follows: basics of random walk procedure and the proposed LRW algorithm are explained in "Methodology" section; an extensive set of experiments on the simulated and real graph data, along with both numerical and visual evaluations are given in "Experiments" section; the conclusions and future work are discussed in "Conclusions" section

Methodology
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call