Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

Marvin Damschen,Frank Mueller,Jorg Henkel

doi:10.1109/tcad.2018.2857042

Abstract

Fused CPU-GPU architectures integrate a CPU and general-purpose GPU on a single die. Recent fused architectures even share the last level cache (LLC) between CPU and GPU. This enables hardware-supported byte-level coherency. Thus, CPU and GPU can execute computational kernels collaboratively, but novel methods to co-schedule work are required. This paper contributes three dynamic co-scheduling methods. Two of our methods implement workers that autonomously acquire work from a common set of independent work items (similar to bag-of-tasks scheduling). The third method, host-side profiling , uses a fraction of the total work of a kernel to determine a ratio of how to distribute work to CPU and GPU based on profiling. The resulting ratio is used for the following executions of the same kernel. Our methods are realized using OpenCL 2.0, which introduces fine-grained shared virtual memory (SVM) to allocate coherent memory between CPU and GPU. We port the Rodinia Benchmark Suite, a standard suite for heterogeneous computing, to fine-grained SVM and fused CPU-GPU architectures ( Rodinia-SVM ). We evaluate the overhead of fine-grained SVM and analyze the suitability of OpenCL 2.0’s new features for co-scheduling. Our host-side profiling method performs competitively to the optimal choice of executing kernels either on CPU or GPU (hypothetical xor-Oracle). On average, it achieves 97% of xor-Oracle’s performance and a $1.43\times$ speedup over using the GPU alone (standard in Rodinia). We show, however, that in most cases it is not beneficial to split the work of a kernel between CPU and GPU compared to exclusively running it on the most suitable single compute device. For a fixed amount of work per device, cache-related stalls can increase by up to $1.75\times$ when both devices are used in parallel instead of exclusively while cache misses remain the same. Thus, not the cost of cache conflicts, but inefficient cache coherence is a major performance bottleneck for current fused CPU-GPU Intel architectures with shared LLC.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems	Publication Date: Nov 1, 2018
Citations: 16	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Lead the way for us

Similar Papers

Enable back memory and global synchronization on LLC buffer
Licheng Yu ... Xueqing Lou
The Journal of Supercomputing | VOL. 73
Licheng Yu, et. al.Licheng Yu ... Xueqing Lou
15 Jun 2017
The Journal of Supercomputing | VOL. 73

Efficient Cache Resizing policy for DRAM-based LLCs in ChipMultiprocessors
Bindu Agarwalla ... Nilkanta Sahu
Journal of Systems Architecture | VOL. 113
Bindu Agarwalla, et. al.Bindu Agarwalla ... Nilkanta Sahu
17 Sep 2020
Journal of Systems Architecture | VOL. 113

Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM Based Last Level Cache
Kunal Korgaonkar ... Ian Young
-
Kunal Korgaonkar, et. al.Kunal Korgaonkar ... Ian Young
01 Jun 2018
01 Jun 2018

Process variation aware DRAM-Cache resizing
Bindu Agarwalla ... Shirshendu Das
Journal of Systems Architecture | VOL. 123
Bindu Agarwalla, et. al.Bindu Agarwalla ... Shirshendu Das
01 Feb 2022
Journal of Systems Architecture | VOL. 123

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems