Multi-node Acceleration for Large-scale GCNs

Gongjian Sun,Dongrui Fan,Yuan Xie,Mingyu Yan,Wenming Li,Xiaochun Ye,Duo Wang,Han Li

doi:10.1109/tc.2022.3207127

Abstract

Limited by the memory capacity and compute power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like TPU-Pod for large-scale neural networks. In this work, we aim to scale up single-node GCN accelerators to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) overall, the acceleration of GCNs in MultiAccSys is bandwidth-bound and latency-tolerant. Guided by these two observations, we then propose MultiGCN, the first MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">first</i> propose a topology-aware multicast mechanism with a one <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">put</monospace> per <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multicast</monospace> message-passing model to reduce transmissions and alleviate network bandwidth requirements. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Second</i> , we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\sim 12\times$</tex-math></inline-formula> speedup using only 28% <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\sim$</tex-math></inline-formula> 68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. It not only achieves 2.5 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\sim 8\times$</tex-math></inline-formula> speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graphs as opposed to single-node GCN accelerators.

Full Text