An efficient iterative graph data processing framework based on bulk synchronous parallel model

Chao Liu,Xuesong Yan,Deze Zeng,Hong Yao,Linchen Yu,Zhangjie Fu

doi:10.1002/cpe.4432

Abstract

SummaryGraph data processing has been widely applied in a variety of domains such as industry, science, social network, and so on. It therefore has stimulated many efforts devoted to this area. To embrace the fast development trend of big graph data, graph data processing based on Pregel‐like systems has been regarded as one of the most promising ways and has widely attracted the attention of researchers. However, it still remains in its early stage and there still exist many challenges. In Pregel, the superstep synchronization is time consuming as the graph data iteration operation requires multiple synchronizations. Furthermore, the graph data partition strategy adopted by Pregel fails to support load balancing, therefore causing the increase of network I/O overhead as the scale of graph data grows. To address these issues, this paper presents an efficient computational framework for graph data processing based on the bulk synchronous parallel model. The global synchronization control mechanism is improved by determining the start time of the next round of superstep through counting the number of global message files. Furthermore, an improved graph data partition mechanism based on a balanced hash method is proposed to reduce the communication overhead between different partitions of sub‐graph computational tasks. We also re‐design the PageRank algorithm to verify the effectiveness of the proposed framework. Experimental results on different real‐world datasets verify the efficiency of our proposed framework as it outperforms Giraph (an open source Pregel‐like system) by 58%−69%, and achieves 10×−17× performance improvement over Hadoop.

Full Text