I/O efficient: computing SCCs in massive graphs

Zhiwei Zhang,Lijun Chang,Jeffrey Xu Yu,Lu Qin,Xuemin Lin

doi:10.1007/s00778-014-0372-z

Abstract

A strongly connected component ($$\mathsf {SCC}$$SCC) is a maximal subgraph of a directed graph $$G$$G in which every pair of nodes is reachable from each other in the $$\mathsf {SCC}$$SCC. With such a property, a general directed graph can be represented by a directed acyclic graph (DAG) by contracting every $$\mathsf {SCC}$$SCC of $$G$$G to a node in DAG. In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all $$\mathsf {SCC}$$SCCs in a directed graph $$G$$G is a critical operation. The existing in-memory algorithms based on depth first search (DFS) can find all $$\mathsf {SCC}$$SCCs in linear time with respect to the size of a graph. However, when a graph cannot reside entirely in the main memory, the existing external or semi-external algorithms to find all $$\mathsf {SCC}$$SCCs have limitation to achieve high I/O efficiency. In this paper, we study new I/O-efficient semi-external algorithms to find all $$\mathsf {SCC}$$SCCs for a massive directed graph $$G$$G that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS-based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two-phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of $$G$$G can be constructed in bounded number of sequential scans of $$G$$G. In the tree search phase, it needs to sequentially scan the graph once to find all $$\mathsf {SCC}$$SCCs. In addition, we propose a new single-phase algorithm, which combines the tree construction and tree search phases into a single phase, with three new optimization techniques. They are early acceptance, early rejection, and batch processing. By the single-phase algorithm with the new optimization techniques, we can significantly reduce the number of I/Os and the CPU cost. We prove the correctness of the algorithms. We conduct extensive experimental studies using 4 real datasets including a massive real dataset and several synthetic datasets to confirm the I/O efficiency of our approaches.

Full Text