Analyzing Multi-trillion Edge Graphs on Large GPU Clusters: A Case Study with PageRank

Seunghwa Kang,Joseph Nke,Brad Rees

doi:10.1109/hpec55821.2022.9926341

Abstract

We previously reported PageRank performance results on a cluster with 32 A100 GPUs [7]. This paper extends the previous work to 2048 GPUs. The previous implementation performs well as long as the number of G PU s is small relative to the square of the average vertex degree but its scalability deteriorates as the number of GPUs further increases. We updated our previous implementation with the following objectives: 1) enable analyzing a P times larger graph with P times more GPUs up to P = 2048, 2) achieve reasonably good weak scaling, and 3) integrate the improvements to the open-source data science ecosystem (i.e. RAPIDS cuGraph, https://github.com/rapidsai/cugraph). While we evaluate the updates with PageRank in this paper, they improve the scalability of a broader set of algorithms in cuGraph. To be more specific, we updated our 2D edge partitioning scheme; implemented the PDCSC (partially doubly compressed sparse column) format which is a hybrid data structure that combines CSC (compressed sparse column) and DCSC (doubly compressed sparse column); adopted (key, value) pairs to store edge source vertex property values; and improved the reduction communication strategy. The 32 GPU cluster has A100 GPUs (40 GB HBM per GPU) connected with NVLink. We ran the updated implementation on the Selene supercomputer which uses InfiniBand for inter-node communication and NVLink for intra-node communication. Each Selene node has eight A100 GPUs (80 GB HBM per GPU). Analyzing the web crawl graph (3.563 billion vertices and 128.7 billion edges, 32 bit vertex ID, unweighted, average vertex degree: 36.12) took 0.187 second per Page Rank iteration on the 32 GPU cluster. Computing Page Rank scores of a scale 38 R-mat graph (274.9 billion vertices and 4.398 trillion edges, 64 bit vertex ID, 32 bit edge weight, average vertex degree: 16) took 1.54 second per Page Rank iteration on the Selene supercomputer with 2048 GPUs. We conclude this paper discussing potential network system enhancements to improve the scaling.

Full Text