Поддержка стандарта OpenMP 4.0 для архитектуры NVIDIA PTX в компиляторе GCC

A.V Monakov,V.A Ivanishin

doi:10.15514/ispras-2016-28(4)-10

Abstract

The paper describes the approach used in implementing OpenMP offloading to NVIDIA accelerators in GCC. Offloading refers to a new capability in OpenMP 4.0 specification update that allows the programmer to specify regions of code that should be executed on an accelerator device that potentially has its own memory space and has an architecture tuned towards highly parallel execution. NVIDIA provides a specification of the abstract PTX architecture for the purpose of low-level, and yet portable, programming of their GPU accelerators. PTX code usually does not use explicit vector (SIMD) computation; instead, vector parallelism is expressed via SIMT (single instruction - multiple threads) execution, where groups of 32 threads are executed in lockstep fashion, with support on hardware level for divergent branching. However, some control flow constructs such as spinlock acquisition can lead to deadlocks, since reconvergence points after branches are inserted implicitly. Thus, our implementation maps logical OpenMP threads to PTX warps (synchronous groups of 32 threads). Individual PTX execution contexts are therefore mapped to logical OpenMP SIMD lanes (this is similar to the mapping used in OpenACC). To implement execution of one logical OpenMP thread by a group of PTX threads we developed a new code generation model that allows to keep all PTX threads active, have their local state (register contents) mirrored, and have side effects from atomic instructions and system calls such as malloc happen only once per warp. This is achieved by executing the original atomic or call instruction under a predicate, and then propagating the register holding the result using the shuffle exchange (shfl) instruction. Furthermore, it is possible to setup the predicate and the source lane index in the shuffle instruction in a way that this sequence has the same effect as just the original instruction inside of SIMD regions. We also describe our implementation of compiler-defined per-warp stacks, which is required to have per-warp automatic storage outside of SIMD regions that allows cross-warp references (normally automatic storage in PTX is implemented via .local memory space which is visible only in the PTX thread that owns it). This is motivated by our use of unmodified OpenMP lowering in GCC where possible, and thus using libgomp routines for entering parallel regions, distribution of loop iterations, etc. We tested our implementation on a set of micro-benchmarks, and observed that there is a fixed overhead of about 100 microseconds when entering a target region, mostly due to startup procedures in libgomp (and notably due to calls to malloc), but for long-running regions where that overhead is small we achieve performance similar to analogous OpenACC and CUDA code.

Highlights

The paper describes the approach used in implementing OpenMP offloading to NVIDIA accelerators in GCC
Offloading refers to a new capability in OpenMP 4.0 specification update that allows the programmer to specify regions of code that should be executed on an accelerator device that potentially has its own memory space and has an architecture tuned towards highly parallel execution
Individual PTX execution contexts are mapped to logical OpenMP SIMD lanes

Summary

Трансляция OpenMP в GCC

Поддержка OpenMP в GCC разделена между собственно компилятором и библиотекой времени выполнения libgomp, поставляемой вместе с другими компонентами компилятора. Публичные функции библиотеки libgomp составляют два пространства имен: функции с префиксом omp_ реализуют функциональность OpenMP API, а функции с префиксом GOMP_ реализуют элементы внутреннего интерфейса между компилятором и libgomp. Есть ровно три OpenMP-прагмы, для которых код полного выражения под прагмой переносится компилятором в отдельную функцию: это прагмы parallel, task и target, так как эти прагмы предписывают, что соответствующий код может выполняться не в рамках текущего контекста, а либо в нескольких параллельных нитях (прагма parallel), в любой нити (прагма task), или на акселераторном устройстве (прагма target). Выделенные из пользовательского кода, принимают аргумент типа (struct omp_data_sN *) — указатель на структуру, содержащую указатели на переменные, которые объявлены вне блока, но используются внутри него (для переменных небольшого размера можно передавать непосредственно их значения вместо указателей, при условии, что внешняя переменная не модифицируется внутри блока: это заведомо так, например, для firstprivate-переменных). Выделение OpenMP-региона в отдельную функцию с передачей адреса разделяемой переменной

Вторичные стеки для автоматических переменных в PTX

Выполнение кода вне SIMD-регионов в синхронных группах

Findings

Тестирование реализации на модельных примерах

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Поддержка стандарта OpenMP 4.0 для архитектуры NVIDIA PTX в компиляторе GCC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS

Lead the way for us

Journal: Proceedings of the Institute for System Programming of the RAS	Publication Date: Jan 1, 2016
License type: cc-by

Similar Papers

Lightweight Hardware Transactional Memory for GPU Scratchpad Memory
Alejandro Villegas ... Angeles Navarro
IEEE Transactions on Computers | VOL. 67
Alejandro Villegas, et. al.Alejandro Villegas ... Angeles Navarro
01 Jun 2018
IEEE Transactions on Computers | VOL. 67

TwinKernels: An execution model to improve GPU hardware scheduling at compile time
Xiang Gong ... Zhongliang Chen
-
Xiang Gong, et. al.Xiang Gong ... Zhongliang Chen
01 Feb 2017
01 Feb 2017

TwinKernels: an execution model to improve GPU hardware scheduling at compile time
...
-
, et. al. ...
04 Feb 2017
04 Feb 2017

Hardware Support for Scratchpad Memory Transactions on GPU Architectures
Alejandro Villegas ... Rafael Asenjo
-
Alejandro Villegas, et. al.Alejandro Villegas ... Rafael Asenjo
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Поддержка стандарта OpenMP 4.0 для архитектуры NVIDIA PTX в компиляторе GCC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS