Device Hopping

Paul Metzger,Christian Fensch,Volker Seeker,Murray Cole

doi:10.1145/3471909

Abstract

Existing OS techniques for homogeneous many-core systems make it simple for single and multithreaded applications to migrate between cores. Heterogeneous systems do not benefit so fully from this flexibility, and applications that cannot migrate in mid-execution may lose potential performance. The situation is particularly challenging when a switch of language runtime would be desirable in conjunction with a migration. We present a case study in making heterogeneous CPU + GPU systems more flexible in this respect. Our technique for fine-grained application migration, allows switches between OpenMP, OpenCL, and CUDA execution, in conjunction with migrations from GPU to CPU, and CPU to GPU. To achieve this, we subdivide iteration spaces into slices, and consider migration on a slice-by-slice basis. We show that slice sizes can be learned offline by machine learning models. To further improve performance, memory transfers are made migration-aware. The complexity of the migration capability is hidden from programmers behind a high-level programming model. We present a detailed evaluation of our mid-kernel migration mechanism with the First Come, First Served scheduling policy. We compare our technique in a focused evaluation scenario against idealized kernel-by-kernel scheduling, which is typical for current systems, and makes perfect kernel to device scheduling decisions, but cannot migrate kernels mid-execution. Models show that up to 1.33× speedup can be achieved over these systems by adding fine-grained migration. Our experimental results with all nine applicable SHOC and Rodinia benchmarks achieve speedups of up to 1.30× (1.08× on average) over an implementation of a perfect but kernel-migration incapable scheduler when migrated to a faster device. Our mechanism and slice size choices introduce an average slowdown of only 2.44% if kernels never migrate. Lastly, our programming model reduces the code size by at least 88% if compared to manual implementations of migratable kernels.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Device Hopping

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Architecture and Code Optimization

Lead the way for us

Similar Papers

Time Slicing Method Based on Network Variation Indicator in Pocket Switched Network
Liang Liao ... Wenjun Zhu
-
Liang Liao, et. al.Liang Liao ... Wenjun Zhu
11 Dec 2020
11 Dec 2020

OpenMP to CUDA graphs
Chenle Yu ... Eduardo Quiñones
-
Chenle Yu, et. al.Chenle Yu ... Eduardo Quiñones
25 May 2020
25 May 2020

CPU: Cross-Rack-Aware Pipelining Update for Erasure-Coded Storage
Haiqiao Wu ... Wan Du
IEEE Transactions on Cloud Computing | VOL. 10
Haiqiao Wu, et. al.Haiqiao Wu ... Wan Du
03 Nov 2020
IEEE Transactions on Cloud Computing | VOL. 10

Programming Models and Tools for Massively Parallel Computers
W.K Giloi
Massively Parallel Processing Applications and Development | VOL. -
W.K GiloiW.K Giloi
01 Jan 1993
Massively Parallel Processing Applications and Development | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Device Hopping

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Architecture and Code Optimization