Scalable big graph processing in MapReduce

Lu Qin,Xuemin Lin,Chengqi Zhang,Lijun Chang,Hong Cheng,Jeffrey Xu Yu

doi:10.1145/2588555.2593661

Abstract

MapReduce has become one of the most popular parallel computing paradigms in cloud, due to its high scalability, reliability, and fault-tolerance achieved for a large variety of applications in big data processing. In the literature, there are MapReduce Class MRC and Minimal MapReduce Class MMC to define the memory consumption, communication cost, CPU cost, and number of MapReduce rounds for an algorithm to execute in MapReduce. However, neither of them is designed for big graph processing in MapReduce, since the constraints in MMC can be hardly achieved simultaneously on graphs and the conditions in MRC may induce scalability problems when processing big graph data. In this paper, we study scalable big graph processing in MapReduce. We introduce a Scalable Graph processing Class SGC by relaxing some constraints in MMC to make it suitable for scalable graph processing. We define two graph join operators in SGC, namely, EN join and NE join, using which a wide range of graph algorithms can be designed, including PageRank, breadth first search, graph keyword search, Connected Component (CC) computation, and Minimum Spanning Forest (MSF) computation. Remarkably, to the best of our knowledge, for the two fundamental graph problems CC and MSF computation, this is the first work that can achieve O(log(n)) MapReduce rounds with $O(n+m)$ total communication cost in each round and constant memory consumption on each machine, where $n$ and $m$ are the number of nodes and edges in the graph respectively. We conducted extensive performance studies using two web-scale graphs Twitter and Friendster with different graph characteristics. The experimental results demonstrate that our algorithms can achieve high scalability in big graph processing.

Full Text