Поддержка стандарта OpenMP 4.0 для архитектуры NVIDIA PTX в компиляторе GCC

A.V Monakov,V.A Ivanishin

doi:10.15514/ispras-2016-28(4)-10

Abstract

The paper describes the approach used in implementing OpenMP offloading to NVIDIA accelerators in GCC. Offloading refers to a new capability in OpenMP 4.0 specification update that allows the programmer to specify regions of code that should be executed on an accelerator device that potentially has its own memory space and has an architecture tuned towards highly parallel execution. NVIDIA provides a specification of the abstract PTX architecture for the purpose of low-level, and yet portable, programming of their GPU accelerators. PTX code usually does not use explicit vector (SIMD) computation; instead, vector parallelism is expressed via SIMT (single instruction - multiple threads) execution, where groups of 32 threads are executed in lockstep fashion, with support on hardware level for divergent branching. However, some control flow constructs such as spinlock acquisition can lead to deadlocks, since reconvergence points after branches are inserted implicitly. Thus, our implementation maps logical OpenMP threads to PTX warps (synchronous groups of 32 threads). Individual PTX execution contexts are therefore mapped to logical OpenMP SIMD lanes (this is similar to the mapping used in OpenACC). To implement execution of one logical OpenMP thread by a group of PTX threads we developed a new code generation model that allows to keep all PTX threads active, have their local state (register contents) mirrored, and have side effects from atomic instructions and system calls such as malloc happen only once per warp. This is achieved by executing the original atomic or call instruction under a predicate, and then propagating the register holding the result using the shuffle exchange (shfl) instruction. Furthermore, it is possible to setup the predicate and the source lane index in the shuffle instruction in a way that this sequence has the same effect as just the original instruction inside of SIMD regions. We also describe our implementation of compiler-defined per-warp stacks, which is required to have per-warp automatic storage outside of SIMD regions that allows cross-warp references (normally automatic storage in PTX is implemented via .local memory space which is visible only in the PTX thread that owns it). This is motivated by our use of unmodified OpenMP lowering in GCC where possible, and thus using libgomp routines for entering parallel regions, distribution of loop iterations, etc. We tested our implementation on a set of micro-benchmarks, and observed that there is a fixed overhead of about 100 microseconds when entering a target region, mostly due to startup procedures in libgomp (and notably due to calls to malloc), but for long-running regions where that overhead is small we achieve performance similar to analogous OpenACC and CUDA code.

Highlights

The paper describes the approach used in implementing OpenMP offloading to NVIDIA accelerators in GCC
Offloading refers to a new capability in OpenMP 4.0 specification update that allows the programmer to specify regions of code that should be executed on an accelerator device that potentially has its own memory space and has an architecture tuned towards highly parallel execution
Individual PTX execution contexts are mapped to logical OpenMP SIMD lanes

Summary

Трансляция OpenMP в GCC

Поддержка OpenMP в GCC разделена между собственно компилятором и библиотекой времени выполнения libgomp, поставляемой вместе с другими компонентами компилятора. Публичные функции библиотеки libgomp составляют два пространства имен: функции с префиксом omp_ реализуют функциональность OpenMP API, а функции с префиксом GOMP_ реализуют элементы внутреннего интерфейса между компилятором и libgomp. Есть ровно три OpenMP-прагмы, для которых код полного выражения под прагмой переносится компилятором в отдельную функцию: это прагмы parallel, task и target, так как эти прагмы предписывают, что соответствующий код может выполняться не в рамках текущего контекста, а либо в нескольких параллельных нитях (прагма parallel), в любой нити (прагма task), или на акселераторном устройстве (прагма target). Выделенные из пользовательского кода, принимают аргумент типа (struct omp_data_sN *) — указатель на структуру, содержащую указатели на переменные, которые объявлены вне блока, но используются внутри него (для переменных небольшого размера можно передавать непосредственно их значения вместо указателей, при условии, что внешняя переменная не модифицируется внутри блока: это заведомо так, например, для firstprivate-переменных). Выделение OpenMP-региона в отдельную функцию с передачей адреса разделяемой переменной

Вторичные стеки для автоматических переменных в PTX

Выполнение кода вне SIMD-регионов в синхронных группах

Findings

Тестирование реализации на модельных примерах