Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes

Javier Cabezas,Isaac Gelado,Nacho Navarro,Thomas B Jablin,Wen-Mei W Hwu,Lluís Vilanova

doi:10.1145/2751205.2751218

Abstract

In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98× and 3.89× execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Automatic execution of single-GPU computations across multiple GPUs
Javier Cabezas ... Lluís Vilanova
-
Javier Cabezas, et. al.Javier Cabezas ... Lluís Vilanova
24 Aug 2014
24 Aug 2014

Multi-GPU System Design with Memory Networks
Gwangsun Kim ... Jiyun Jeong
-
Gwangsun Kim, et. al.Gwangsun Kim ... Jiyun Jeong
01 Dec 2014
01 Dec 2014

Efficient automatic parallelization of a single GPU program for a multiple GPU system
Matam Kiran Kumar ... Murali Annavaram
Integration | VOL. 66
Matam Kiran Kumar, et. al.Matam Kiran Kumar ... Murali Annavaram
07 Jan 2019
Integration | VOL. 66

Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU
Wei Han ... Matthew Buland
-
Wei Han, et. al.Wei Han ... Matthew Buland
01 Sep 2017
01 Sep 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes

Abstract

Talk to us

Similar Papers