Efficient SimRank-Based Similarity Join

Weiguo Zheng,Lei Chen,Lei Zou,Dongyan Zhao

doi:10.1145/3083899

Abstract

Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. In this article, we adopt “SimRank” [13] to evaluate the similarity between two vertices in a large graph because of its generality. Note that “Simank” is purely structure dependent, and it does not rely on the domain knowledge. Specifically, we define a S im R ank-based j oin ( SRJ ) query to find all vertex pairs satisfying the threshold from two sets of vertices U and V . To reduce the search space, we propose a shortest-path-distance-based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover + , to efficiently compute the SimRank score of any single vertex pair. Given a graph G , we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover + vertex pairs), based on which the SimRank score of any vertex pair can be computed easily. To find the h-go cover + vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.

Full Text