With the proliferation of knowledge graphs, massive RDF graphs have been published on the Web. As an essential type of queries for RDF graphs, Regular Path Queries (RPQs) have been attracting increasing research efforts. However, the existing query processing approaches mainly focus on RPQs under the standard semantics, which cannot provide the provenance of the answer sets. We propose a distributed Pregel-based approach DP2RPQ to evaluating provenance-aware RPQs over big RDF graphs. Our method employs Glushkov automata to keep track of matching processes of RPQs in parallel. Meanwhile, three optimization strategies are devised according to the cost model, including vertex-computation optimization, message-communication reduction, and counting-paths alleviation, which can reduce the intermediate results of the basic DP2RPQ algorithm dramatically and overcome the counting-paths problem to some extent. The proposed algorithms are verified by extensive experiments on both synthetic and real-world datasets, which show that our approach can efficiently answer the provenance-aware RPQs over large RDF graphs. Furthermore, the RPQ semantics of DP2RPQ is richer than that of RDFPath, and the performance of DP2RPQ is still far better than that of RDFPath.
Read full abstract