Wukong+G: Fast and Concurrent RDF Query Processing Using RDMA-Assisted GPU Graph Exploration

Zihang Yao,Rong Chen,Binyu Zang,Haibo Chen

doi:10.1109/tpds.2021.3121568

Abstract

RDF graph has been increasingly used to store and represent information shared over the Web, including social graphs and knowledge bases. With the increasing scale of RDF graphs and the concurrency level of SPARQL queries, current RDF systems are confronted with inefficient concurrent query processing on massive data parallelism. The situation becomes more severe in the face of data-intensive queries (aka heavy query), which usually lead to suboptimal response time (latency) as well as throughput collapse. In this article, we present Wukong+G, the first graph-based distributed RDF query processing system that efficiently exploits the hybrid parallelism of CPU and GPU. Wukong+G is made fast and concurrent with four key designs. First, Wukong+G tames massive random memory accesses in graph exploration by efficiently mapping data between CPU and GPU for latency hiding, including a set of techniques like query-aware prefetching, pattern-aware pipelining and fine-grained swapping. Second, Wukong+G scales up by introducing a GPU-friendly RDF store to support RDF graphs exceeding GPU memory size, by using techniques like predicate-based grouping, pairwise caching and look-ahead replacing to narrow the gap between host and device memory scale. Third, Wukong+G scales out through a communication layer that decouples the transferring process for query metadata and intermediate results, and further leverages both native and GPUDirect RDMA to enable efficient communication on a CPU/GPU cluster. Finally, Wukong+G simultaneously runs multiple queries on a single GPU to improve overall throughput and fully exploits hardware heterogeneity (CPU/GPU) by scheduling a single query on CPU and GPU adaptively. We have implemented Wukong+G by extending a state-of-the-art distributed RDF store (i.e., Wukong) with distributed GPU support. Evaluation on a heterogeneous CPU/GPU cluster with RDMA-capable network shows that Wukong+G outperforms Wukong by up to 9.0× (from 2.3×) and scales well on 10 GPU cards for heavy queries. Wukong+G can also improve both latency and throughput by more than one order of magnitude when facing hybrid workloads.

Full Text