Managing GPU Concurrency in Heterogeneous Architectures

Onur Kayiran,Adwait Jog,Gabriel H Loh,Mahmut T Kandemir,Nachiappan Chidambaram Nachiappan,Rachata Ausavarungnirun,Chita R Das,Onur Mutlu

doi:10.1109/micro.2014.62

Onur Kayiran, Adwait Jog + Show 6 more

Open Access

https://doi.org/10.1109/micro.2014.62

Copy DOI

Abstract

Heterogeneous architectures consisting of general-purpose CPUs and throughput-optimized GPUs are projected to be the dominant computing platforms for many classes of applications. The design of such systems is more complex than that of homogeneous architectures because maximizing resource utilization while minimizing shared resource interference between CPU and GPU applications is difficult. We show that GPU applications tend to monopolize the shared hardware resources, such as memory and network, because of their high thread-level parallelism (TLP), and discuss the limitations of existing GPU-based concurrency management techniques when employed in heterogeneous systems. To solve this problem, we propose an integrated concurrency management strategy that modulates the TLP in GPUs to control the performance of both CPU and GPU applications. This mechanism considers both GPU core state and system-wide memory and network congestion information to dynamically decide on the level of GPU concurrency to maximize system performance. We propose and evaluate two schemes: one (CM-CPU) for boosting CPU performance in the presence of GPU interference, the other (CM-BAL) for improving both CPU and GPU performance in a balanced manner and thus overall system performance. Our evaluations show that the first scheme improves average CPU performance by 24%, while reducing average GPU performance by 11%. The second scheme provides 7% average performance improvement for both CPU and GPU applications. We also show that our solution allows the user to control performance trade-offs between CPUs and GPUs.

Highlights

GPUs have made headway as the computational workhorses for many throughput-oriented applications compared to general purpose CPUs [32]
We show that existing GPU concurrency management solutions [24, 43] are suboptimal for maximizing overall system performance in a heterogeneous CPU-GPU system due to the large differences in latency/bandwidth requirements of CPU and GPU applications
Before discussing the necessity of thread-level parallelism (TLP) management in a heterogeneous platform and proposing our solution, we describe our baseline architecture consisting of cores, a network-on-chip (NoC), and memory controllers (MCs)

Summary

INTRODUCTION

GPUs have made headway as the computational workhorses for many throughput-oriented applications compared to general purpose CPUs [32]. These works either target performance improvements only for cache-sensitive applications (CCWS [43]) or propose solutions based on the latency tolerance of GPU cores (DYNCTA [24]) These mechanisms are oblivious to the CPU cores and do not take into account system-wide information (such as memory and network congestion). While one may argue that the alternative approach of partitioning network and memory resources between the CPU and the GPU might solve the contention problem, we show that such resource isolation leads to severe underutilization of resources, which in turn hurts either CPU or GPU performance, or both, significantly (Section 2.2) For this reason, we use a system with shared resources as our baseline in most of our evaluations, and show (in Section 6.4) that our proposed schemes work effectively in a system with partitioned resources as well. This is the first work that introduces new GPU concurrency management mechanisms to improve both CPU and GPU performance in heterogeneous systems

Baseline Configuration

Network and Memory Controller Configuration

Limitations of Existing Techniques

Effects of GPU Concurrency on GPU Performance

Effects of GPU Concurrency on CPU Performance

MANAGING GPU CONCURRENCY

Pipeline register full

Part 1 of CM-BAL

EXPERIMENTAL METHODOLOGY

49.8 H-MPKI

Dynamism of Concurrency

Application Performance Results

Comparison to Static Warp Limiting

Sensitivity Experiments

Other Analyses and Discussion

RELATED WORK

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Managing GPU Concurrency in Heterogeneous Architectures

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Dec 1, 2014
Citations: 167	License type: cc-by

Similar Papers

Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol
Juan Fang ... Zeqing Chang
-
Juan Fang, et. al.Juan Fang ... Zeqing Chang
01 Jan 2017
01 Jan 2017

Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture
Hao Wen ... Wei Zhang
-
Hao Wen, et. al.Hao Wen ... Wei Zhang
01 Sep 2019
01 Sep 2019

Research on Cache Partitioning and Adaptive Replacement Policy for CPU-GPU Heterogeneous Processors
Juan Fang ... Xibei Zhang
-
Juan Fang, et. al.Juan Fang ... Xibei Zhang
01 Oct 2017
01 Oct 2017

Design space exploration of on-chip ring interconnection for a CPU–GPU heterogeneous architecture
Jaekyu Lee ... Sudhakar Yalamanchili
Journal of Parallel and Distributed Computing | VOL. 73
Jaekyu Lee, et. al.Jaekyu Lee ... Sudhakar Yalamanchili
14 Aug 2013
Journal of Parallel and Distributed Computing | VOL. 73

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Managing GPU Concurrency in Heterogeneous Architectures

Abstract

Highlights

Summary

Talk to us

Similar Papers