Locality-Centric Data and Threadblock Management for Massive GPUs

Mahmoud Khairy,Vadim Nikiforov,David Nellans,Timothy G Rogers

doi:10.1109/micro50266.2020.00086

Abstract

Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip will not be practical due to slowing growth in transistor density, low chip yields, and photoreticle limitations. To maintain performance scalability, proposals exist to aggregate discrete GPUs into a larger virtual GPU and decompose a single GPU into multiple-chip-modules with increased aggregate die area. These approaches introduce non-uniform memory access (NUMA) effects and lead to decreased performance and energy-efficiency if not managed appropriately. To overcome these effects, we propose a holistic Locality-Aware Data Management (LADM) system designed to operate on massive logical GPUs composed of multiple discrete devices, which are themselves composed of chiplets. LADM has three key components: a threadblock-centric index analysis, a runtime system that performs data placement and threadblock scheduling, and an adaptive cache insertion policy. The runtime combines information from the static analysis with topology information to proactively optimize data placement, threadblock scheduling, and remote data caching, minimizing off-chip traffic. Compared to state-of-the-art multi-GPU scheduling, LADM reduces inter-chip memory traffic by 4× and improves system performance by 1.8× on a future multi-GPU system.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Locality-Centric Data and Threadblock Management for Massive GPUs

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Scalable Task Parallelism for NUMA
Andi Drebes ... Karine Heydemann
-
Andi Drebes, et. al.Andi Drebes ... Karine Heydemann
11 Sep 2016
11 Sep 2016

Improving GHC Haskell NUMA profiling
Ruairidh Macgregor ... Phil Trinder
-
Ruairidh Macgregor, et. al.Ruairidh Macgregor ... Phil Trinder
22 Aug 2021
22 Aug 2021

Automatic Placement of Tasks to NUMA Nodes in Iterative Applications
Jiri Dokulil ... Siegfried Benkner
-
Jiri Dokulil, et. al.Jiri Dokulil ... Siegfried Benkner
01 Mar 2020
01 Mar 2020

Model-based, memory-centric performance and power optimization on NUMA multiprocessors
Chunyi Su ... Bronis R De Supinski
-
Chunyi Su, et. al.Chunyi Su ... Bronis R De Supinski
01 Nov 2012
01 Nov 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Locality-Centric Data and Threadblock Management for Massive GPUs

Abstract

Talk to us

Similar Papers