Graph Neural Network (GNN) models have attracted attention, given their high accuracy in interpreting graph data. One of the primary building blocks of a GNN model is aggregation, which gathers and averages the feature vectors corresponding to the nodes adjacent to each node. Aggregation works by multiplying the adjacency and feature matrices. The size of both matrices exceeds the on-chip cache capacity for many realistic datasets, and the adjacency matrix is highly sparse. These characteristics lead to little data reuse, causing intensive main-memory accesses during the aggregation process. Thus, aggregation exhibits memory-intensive characteristics and dominates most of the total execution time. In this paper, we propose GraNDe, an NDP architecture that accelerates memory-intensive aggregation operations by locating NDP modules near DRAM datapath to exploit rank-level parallelism. GraNDe maximizes bandwidth utilization by separating the memory channel path with the buffer chip in between so that pre-/post-processing in the host processor and reduction in NDP modules operate simultaneously. By exploring the preferred data mappings of the operand matrices to DRAM ranks, we architect GraNDe to support adaptive matrix mapping that applies the optimal mapping for each layer depending on the dimension of the layer and the configuration of a memory system. We also propose adj-bundle broadcasting and re-tiling optimizations to reduce the transfer time for adjacency matrix data and to improve feature vector data reusability by exploiting tiling with consideration of adjacency between nodes. GraNDe achieves 3.01× and 1.69× on average, and up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$4.00\times$</tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$1.98\times$</tex-math></inline-formula> speedups of GCN aggregation over the baseline system and the state-of-the-art NDP architecture for GCN, respectively.
Read full abstract