Abstract

Given a graph, how can we quantify similarity between two nodes in an effective and scalable way? SimRank is an attractive measure of pairwise similarity based on graph topologies. Its underpinning philosophy that “two nodes are similar if they are pointed to (have incoming edges) from similar nodes” can be regarded as an aggregation of similarities based on incoming paths. Despite its popularity in various applications (e.g., web search and social networks), SimRank has an undesirable trait, i.e., “zero-similarity”: it accommodates only the paths of equal length from a common “center” node, whereas a large portion of other paths are fully ignored. In this paper, we propose an effective and scalable similarity model, SimRank*, to remedy this problem. (1) We first provide a sufficient and necessary condition of the “zero-similarity” problem that exists in Jeh and Widom’s SimRank model, Li et al. ’s SimRank model, Random Walk with Restart (RWR), and ASCOS++. (2) We next present our treatment, SimRank*, which can resolve this issue while inheriting the merit of the simple SimRank philosophy. (3) We reduce the series form of SimRank* to a closed form, which looks simpler than SimRank but which enriches semantics without suffering from increased computational overhead. This leads to an iterative form of SimRank*, which requires O(Knm) time and O(n^2) memory for computing all (n^2) pairs of similarities on a graph of n nodes and m edges for K iterations. (4) To improve the computational time of SimRank* further, we leverage a novel clustering strategy via edge concentration. Due to its NP-hardness, we devise an efficient heuristic to speed up all-pairs SimRank* computation to O(Kn{tilde{m}}) time, where {tilde{m}} is generally much smaller than m. (5) To scale SimRank* on billion-edge graphs, we propose two memory-efficient single-source algorithms, i.e., ss-gSR* for geometric SimRank*, and ss-eSR* for exponential SimRank*, which can retrieve similarities between all n nodes and a given query on an as-needed basis. This significantly reduces the O(n^2) memory of all-pairs search to either O(Kn + {tilde{m}}) for geometric SimRank*, or O(n + {tilde{m}}) for exponential SimRank*, without any loss of accuracy, where {tilde{m}} ll n^2. (6) We also compare SimRank* with another remedy of SimRank that adds self-loops on each node and demonstrate that SimRank* is more effective. (7) Using real and synthetic datasets, we empirically verify the richer semantics of SimRank*, and validate its high computational efficiency and scalability on large graphs with billions of edges.

Highlights

  • The task of assessing similarity between two nodes based on graph topology is a long-standing problem in hyperlink b query=f a c fgi d ehPairs JSR LSR PR Random Walk with Restart (RWR) ASCOS SR* (a, f ) 0 (b, f ) 0 (d, f ) 0 (e, f ) 0Recently, SimRank [12] has received growing interest as a widely-accepted measure of pairwise similarity

  • (1) We first provide a sufficient and necessary condition of the “zero-similarity” problem that exists in Jeh and Widom’s SimRank model, Li et al.’s SimRank model, Random Walk with Restart (RWR), and ASCOS++. (2) We present our treatment, SimRank*, which can resolve this issue while inheriting the merit of the simple SimRank philosophy

  • Due to its NP-hardness, an efficient algorithm is devised to speed up all-pairs SimRank* computation to O(K nm ) time, where mis the number of edges in our compressed graph, which is generally much smaller than m (Sect. 6). – To scale SimRank* over billion-edge graphs, we propose two memory-efficient single-source algorithms for SimRank*, i.e., ss-gSR* for geometric SimRank*, and sseSR* for exponential SimRank*, that require O(K 2m ) time and O(K m ) time, respectively, to compute similarities between all n nodes and a given query on an as-needed basis

Read more

Summary

Introduction

SimRank [12] has received growing interest as a widely-accepted measure of pairwise similarity. As demonstrated by our experiments, both issues of “zero-similarity” commonly exist in real graphs, e.g., on CitH, ∼ 97.9% node-pairs have “zero-SimRank” issues, among which ∼ 19.2% are evaluated to be “completely dissimilar”, and ∼ 78.7% (though SimRank =0) to be “partially missing” the contributions of many in-link paths. These have adversely affected assessment quality, which highlights our need to enhance the existing SimRank model. This type of query is practically useful when answering the questions such as “who have close interactions with Diego (query) in a social network?”, and “which papers are relevant to this one (query) in a co-citation graph?”

Main contributions
Jeh and Widom’s SimRank model
Preliminaries
Counting in-link paths
Geometric series form of SimRank*
Weighted factors of two types
Convergence of SimRank*
Closed form of exponential SimRank*
Recursive form of geometric SimRank*
Fine-grained memoization
Induced bigraph
Biclique compression via edge concentration
Single-source geometric SimRank*
Single-source exponential SimRank*
Comparison with “adding self-loops”
Experimental settings
Quantitative results on semantic effectiveness
Qualitative case studies on semantics
10.1 Link-based similarity measures
10.2 Optimization methods for computing similarities
11 Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call