A look at the OpenCL 2.0 execution model

Benedict Gaster

doi:10.1145/2791321.2791323

Abstract

A popular approach to programming manycore GPUs is the Single Instruction Multiple Thread (SIMT) abstraction. SIMT has the benefit of presenting a single thread view, alleviating the complexity of explicitly vectorizing the source code. However, due to the SIMD nature of the underlying hardware it is often difficult to fully hide all aspects from the developer. An example of leaks, is OpenCL's barrier, which requires all workitems (i.e. threads) to reach and execute the barrier.But what does it mean to reach and execute the same barrier? OpenCL provides very little information about the underlying semantics. In this talk we explore OpenCL's execution model, from both a programmer's perspective but also considering the set of valid translations that an optimizing compiler can perform while retaining the intended semantics.Using a set of examples, sometimes surprisingly, we show that common transformations often performed by traditional scalar compilers are not, in general, valid when applied to OpenCL code containing workgroup (or subgroup) collective operations. Additionally, we introduce a mathematical notion of workgroup and subgroup uniformity and outline an execution model for OpenCL 2.0, which enables these traditional compiler transformations to be applied, even in the presence of collective operations, for a t of all valid OpenCL programs. The model clearly describes when it is valid and when it is not valid to apply these transformations.This talk is intended for OpenCL developers and compiler writers alike, providing insight into the often ill-documented OpenCL execution model, its intended design choices, and how different implementer's might implement contentious aspects of the specification differently.

Full Text