Abstract

Searching and mining in large graphs is critical to a variety of applications, at the core of which is the pattern matching activity. The scalable processing of large graphs requires careful distribution of graphs across clusters. Graph partitioning is the technique that divides a big graph into several non-overlapped subgraphs and assigns each subgraph to a compute node. Traditional workload agnostic partitioners aim to minimize the number of inter-partition edges using only graph topology, which, however, may not obtain the best solution if the workload exhibits skew. Some workload-aware partitioners choose to mine information from a specific workload and use it to minimize the number of inter-partition traversals during execution; however, their methods are not suitable for pattern matching applications. In this work, we propose a query-sensitive graph partitioner that aims to improve existing partitioning for a given pattern matching workload. The partitioner takes any initial partitioning as a starting point and iteratively adjusts it by exchanging chosen clusters across partitions, heuristically reducing the probability of inter-partition traversals. We determine a few implementation-irrelative factors that may increase the traversal probability of an edge and quantify them into a calculable indicator with information from query patterns and graph topology. Then, we propose an efficient algorithm to calculate the indicator and implement a graph repartitioner by combining the indicator with a greedy cluster-exchanging mechanism. Finally, we generate a large heterogeneous labeled graph with real-world data crawled from the Netease Music website and evaluate the partitioning quality of our repartitioner with a few meaningful query patterns of common topologies including line, loop and branching. Compared with a hash-based partitioning, our system can reduce the inter-partition traversals by at least 70%. Compared with the state-of-the-art graph partitioner Metis , our repartitioner can reduce the inter-partition traversals by at least 50%.

Highlights

  • Modern big data increasingly appear in the form of large heterogeneous labeled graphs

  • We propose a simple heuristics to estimate the traversal probability of each edge by combining the topological information from graph and patterns and design an efficient algorithm to calculate them

  • We propose a simple greedy algorithm to compute the exchanging cluster of each partition so that the global inter-partition traversal probability would decrease if these clusters are exchanged between partitions

Read more

Summary

INTRODUCTION

Modern big data increasingly appear in the form of large heterogeneous labeled graphs. FENNEL [7] overcomes the high computing complexity of the traditional k-balanced graph partitioning problem by relaxing the hard cardinality constraints This method provides a unifying framework that accommodates many of the previously proposed heuristics as special cases. TAPER is the most relevant work to this paper, i.e., improving an existing graph partitioning for a set of query patterns without use of historical trace or log. The specific contributions of this work are as follows: (1) First, we determine a few implementation-irrelative factors that influence the edge traversal probability (ETP) in pattern matching and develop a heuristic formula to estimate it with information from query patterns and graph topology.

RELATED WORKS
INDICATOR OF EDGE TRAVERSAL PROBABILITY
CALCULATING EDGE TRAVERSAL PROBABILITY
COMPUTING EXCHANGING CLUSTERS
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.