An Auto-Tuner for OpenCL Work-Group Size on GPUs

Thanh Tuan Dao,Jaejin Lee

doi:10.1109/tpds.2017.2755657

Abstract

Tuning the kernel work-group size for GPUs is a challenging problem. In this paper, using the performance counters provided by GPUs, we characterize a large body of OpenCL kernels to identify the performance factors that affect the choice of a good work-group size. Based on the characterization, we realize that the most influential performance factors with regard to the work-group size include occupancy, coalesced global memory accesses, cache contention, and variation in the amount of workload in the kernel. By tackling the performance factors one by one, we propose auto-tuning techniques that selects the best work-group size and shape for GPU kernels. We show the effectiveness of our auto-tuner by evaluating it with a set of 54 OpenCL kernels on three different NVIDIA GPUs and one AMD GPU. On average, the auto-tuner needs to spend no more than 8 percent of the time required by an exhaustive search to find an optimal work-group size. The execution time of the selected sub-optimal work-group size is at most 1.14x slower than that of the optimal work-group size found by the exhaustive search, on average.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Auto-Tuner for OpenCL Work-Group Size on GPUs

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Feb 1, 2018
Citations: 41

Similar Papers

Automatic OpenCL work-group size selection for multicore CPUs
...
-
, et. al. ...
07 Oct 2013
07 Oct 2013

Exposing ILP in custom hardware with a dataflow compiler IR
Sangmin Seo ... Gangwon Jo
-
Sangmin Seo, et. al. Sangmin Seo ... Gangwon Jo
01 Oct 2013
01 Oct 2013

Efficient and Portable Workgroup Size Tuning
Chia-Lin Yu ... Shiao-Li Tsao
IEEE Transactions on Parallel and Distributed Systems | VOL. 31
Chia-Lin Yu, et. al.Chia-Lin Yu ... Shiao-Li Tsao
01 Feb 2020
IEEE Transactions on Parallel and Distributed Systems | VOL. 31

Viability Study of SYCL as a Unified Programming Model for Heterogeneous Systems Based on GPUs in Bioinformatics
Manuel Costanzo
Journal of Computer Science and Technology | VOL. 24
Manuel CostanzoManuel Costanzo
18 Oct 2024
Journal of Computer Science and Technology | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Auto-Tuner for OpenCL Work-Group Size on GPUs

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems