Abstract
Tuning the kernel work-group size for GPUs is a challenging problem. In this paper, using the performance counters provided by GPUs, we characterize a large body of OpenCL kernels to identify the performance factors that affect the choice of a good work-group size. Based on the characterization, we realize that the most influential performance factors with regard to the work-group size include occupancy, coalesced global memory accesses, cache contention, and variation in the amount of workload in the kernel. By tackling the performance factors one by one, we propose auto-tuning techniques that selects the best work-group size and shape for GPU kernels. We show the effectiveness of our auto-tuner by evaluating it with a set of 54 OpenCL kernels on three different NVIDIA GPUs and one AMD GPU. On average, the auto-tuner needs to spend no more than 8 percent of the time required by an exhaustive search to find an optimal work-group size. The execution time of the selected sub-optimal work-group size is at most 1.14x slower than that of the optimal work-group size found by the exhaustive search, on average.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Parallel and Distributed Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.