Locality-aware Thread Block Design in Single and Multi-GPU Graph Processing

Quan Fan,Zizhong Chen

doi:10.1109/nas51552.2021.9605484

Abstract

Graphics Processing Unit (GPU) has been adopted to process graphs effectively. Recently, multi-GPU systems are also exploited for greater performance boost. To process graphs on multiple GPUs in parallel, input graphs should be partitioned into parts using partitioning schemes. The partitioning schemes can impact the communication overhead, locality of memory accesses, and further improve the overall performance. We found that both intra-GPU data sharing and inter-GPU communication can be summarized as inter-TB communication. Based on this key idea, we propose a new graph partitioning scheme by redefining the input graph as a TB Graph with calculated vertex and edge weights, and then partition it to reduce intra & inter-GPU communication overhead and improve the locality at the granularity of Thread Blocks (TB). We also propose to develop a partitioning and mapping scheme for heterogeneous architectures including physical links with different bandwidths. The experimental results on graph partitioning show that our scheme is effective to improve the overall performance of the Breadth First Search (BFS) by up to 33%.

Full Text