Warped-slicer

Qiumin Xu,Keunsoo Kim,Hyeran Jeon,Won Woo Ro,Murali Annavaram

doi:10.1145/3007787.3001161

Abstract

As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer , a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Warped-slicer

Abstract

Talk to us

Similar Papers

More From: ACM SIGARCH Computer Architecture News

Lead the way for us

Journal: ACM SIGARCH Computer Architecture News	Publication Date: Jun 18, 2016
Citations: 97

Similar Papers

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming
Qiumin Xu ... Murali Annavaram
-
Qiumin Xu, et. al.Qiumin Xu ... Murali Annavaram
01 Jun 2016
01 Jun 2016

Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPU
Chen Zhao ... Huiyang Zhou
Future Generation Computer Systems | VOL. 112
Chen Zhao, et. al.Chen Zhao ... Huiyang Zhou
21 May 2020
Future Generation Computer Systems | VOL. 112

Process variation-aware workload partitioning algorithms for GPUs supporting spatial-multitasking
...
-
, et. al. ...
24 Mar 2014
24 Mar 2014

Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
Jason Jong Kyu Park ... Scott Mahlke
ACM SIGPLAN Notices | VOL. 52
Jason Jong Kyu Park, et. al.Jason Jong Kyu Park ... Scott Mahlke
04 Apr 2017
ACM SIGPLAN Notices | VOL. 52

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Warped-slicer

Abstract

Talk to us

Similar Papers

More From: ACM SIGARCH Computer Architecture News