Abstract

While data parallelism aspects of OpenCL have been of primary interest due to the massively data parallel GPUs being on focus, OpenCL also provides powerful capabilities to describe task parallelism. In this article we study the task parallel concepts available in OpenCL and find out how well the different vendor-specific implementations can exploit task parallelism when the parallelism is described in various ways utilizing the command queues. We show that the vendor implementations are not yet capable of extracting kernel-level task parallelism from in-order queues automatically. To assess the potential performance benefits of in-order queue parallelization, we implemented such capabilities to an open source implementation of OpenCL. The evaluation was conducted by means of a case study of an advanced noise reduction algorithm described as a multi-kernel OpenCL application.

Highlights

  • OpenCL is a widely-adopted programming standard for parallel heterogeneous systems

  • While data parallelism aspects of OpenCL have been of primary interest to its users due to the massively parallel GPU devices being on focus, OpenCL provides extensive capabilities to describe heterogeneous task parallelism by means of pushing commands to one or more command queues controlling one or more devices, and using events, command queue barriers or kernel argument buffer data dependencies for synchronization

  • The results suggest that AMD’s SDK is not currently making data locality aware scheduling decisions based on the command queue dependencies, but schedules from command queues “fairly” which had severe impact on the platforms with more limited cache resources of this case study

Read more

Summary

Introduction

OpenCL is a widely-adopted programming standard for parallel heterogeneous systems. The goal of the standard is to support a wide range of heterogeneous platforms efficiently and provide source code portability across them. While data parallelism aspects of OpenCL have been of primary interest to its users due to the massively parallel GPU devices being on focus, OpenCL provides extensive capabilities to describe heterogeneous task parallelism by means of pushing commands to one or more command queues controlling one or more devices, and using events, command queue barriers or kernel argument buffer data dependencies for synchronization. We consider this side of the standard underutilized despite it being the feature to efficiently harness devices in heterogeneous platforms to collaboratively execute multikernel applications by reducing the “master role” of the host program.

Platform-Wide Execution of Heterogeneous Task Graphs
Task Parallel Concepts in OpenCL
Converting Command Queues to Task Graphs
Constructing the Task Graph
Command Queue Data Dependence Analysis
Implementing a Task Scheduling Runtime
Dynamic Construction of Task Graphs
Dynamic Task Scheduling for Shared Memory Multicores
The Application
Tested Runtimes
Related Work
Conclusions
12. Movidius
Findings
15. Texas Instruments
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call