Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

Leul Belayneh,Trevor Mudge,Ronald Dreslinski,Haojie Ye,Kuan-Yu Chen,David Blaauw,Nishil Talati

doi:10.1145/3559009.3569649

Abstract

With generational gains from transistor scaling, GPUs have been able to accelerate traditional computation-intensive workloads. But with the obsolescence of Moore's Law, single GPU systems are no longer able to satisfy the computational and memory requirements of emerging workloads. To remedy this, prior works have proposed tightly-coupled multi-GPU systems. However, multi-GPU systems are hampered from efficiently utilizing their compute resources due to the Non-Uniform Memory Access (NUMA) bottleneck. In this paper, we propose DualOpt, a lightweight hardware-only solution that reduces the remote memory access latency by delivering optimizations catered to a workload's locality profile. DualOpt uses the spatio-temporal locality of remote memory accesses as a metric to classify workloads as cache insensitive and cache-friendly. Cache insensitive workloads exhibit low spatio-temporal locality, while cache-friendly workloads have ample locality that is not exploited well by the conventional cache subsystem of the GPU. For cache insensitive workloads, DualOpt proposes a fine-granularity transfer of remote data instead of the conventional cache line transfer. These remote data are then coalesced so as to efficiently utilize inter-GPU bandwidth. For cache-friendly workloads, DualOpt adds a remote-only cache that can exploit locality in remote accesses. Finally, a decision engine automatically identifies the class of workload and delivers the corresponding optimization, which improves overall performance by 2.5× on a 4-GPU system, with a small hardware overhead of 0.032%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems
Vinson Young ... Evgeny Bolotin
-
Vinson Young, et. al.Vinson Young ... Evgeny Bolotin
01 Oct 2018
01 Oct 2018

Enhancing Inter-Node Process Migration for Load Balancing on Linux-Based NUMA Multicore Systems
Mei-Ling Chiang ... Shu-Wei Tu
-
Mei-Ling Chiang, et. al.Mei-Ling Chiang ... Shu-Wei Tu
01 Jul 2018
01 Jul 2018

A Tool to Detect Performance Problems of Multi-threaded Programs on NUMA Systems
Liang Zhu ... Hai Jin
-
Liang Zhu, et. al.Liang Zhu ... Hai Jin
01 Aug 2016
01 Aug 2016

Memory‐aware kernel mechanism and policies for improving internode load balancing on NUMA systems
Mei‐Ling Chiang ... Shu‐Wei Tu
Software: Practice and Experience | VOL. 49
Mei‐Ling Chiang, et. al.Mei‐Ling Chiang ... Shu‐Wei Tu
19 Jul 2019
Software: Practice and Experience | VOL. 49

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

Abstract

Talk to us

Similar Papers