CODA

Hyojong Kim,Lifeng Nai,Ramyad Hadidi,Gabriel Loh,Yasuko Eckert,Hyesoon Kim,Nuwan Jayasena,Onur Kayiran

doi:10.1145/3232521

Abstract

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.

Full Text