Kernel Launch Research Articles

SUMMARYGeneral purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi‐clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro‐benchmark1 considered in this paper resulted in 10–40% and 80–92% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9% and 152.3% improvement compared with separate kernel launch and 39.5% and 36.8% improvement compared with block level kernel coalesce method, respectively.Copyright © 2013 John Wiley & Sons, Ltd.

Read full abstract

Purpose: Algebraic reconstruction technique (ART) type algorithms produce superior image quality for CBCT and CT reconstructions over the popular filtered‐back‐projection based approaches but are too slow for real‐time clinical applications. The purpose of this study is to employ the emerging OpenCL architecture to accelerate simultaneous ART (SART) by parallelizing the most time‐consuming forward‐ and back‐projections using General‐Purpose‐Graphics‐Processing‐Unit (GPGPU). Methods: For each iteration, SART sequentially performs three ray‐driven projections (one forward‐ and two back‐projections) for each acquired projection image. To accelerate SART reconstruction, both forward projection and back‐projection kernels were scheduled on GPGPU using data parallelism to take full advantage of compute units on GPGPU. The single‐work‐item‐for‐single‐ray technique was employed as parallelization mechanism. We conducted numerical experiments to test OpenCL‐based implementation on a Dell Precision T7500 workstation with two quad‐core CPUs and one Nvidia Tesla C2050 GPGPU. Poly‐energetic projection data (512×512) for the Mohan 4 MV energy spectrum were simulated each degree for 360 gantry angles for a head‐and‐neck digital phantom and were fed into the SART algorithms for CBCT reconstruction of 256×256×256 volume. To accelerate poly‐energetic projection computation, we partitioned the workloads using task parallelism and data parallelism and scheduled them in a parallel computing ecosystem consisting of CPU and GPGPU using OpenCL only. Results: The GPGPU computation time including the kernel launch time, kernel running time and data transfer time was 42 ms for forward‐projection and 95 ms for back‐projection. Each SART iteration took 101 s on GPGPU in comparison to 7195 s on a single‐threaded CPU. The proposed method achieved a ∼71 ‐times speedup. The relative difference of the reconstructed images between the CPU‐based and OpenCL/GPGPU‐based implementations was on the order of 0.00001 and virtually indistinguishable. Conclusions: We have successfully implemented the SART algorithm on GPGPU using OpenCL and significantly reduced the reconstruction time to a level that is almost suitable for real‐time clinical applications.

Read full abstract

Kernel Launch Research Articles

Related Topics

Articles published on Kernel Launch

MrBayes sMC3

Accelerating sparse Cholesky factorization on GPUs

SU-G-TeP1-15: Toward a Novel GPU Accelerated Deterministic Solution to the Linear Boltzmann Transport Equation

Dynamic thread block launch

Accelerating DynEarthSol3D on tightly coupled CPU–GPU heterogeneous processors

On optimizing machine learning workloads via kernel fusion

Solving seven-equation model for compressible two-phase flow using multiple GPUs

Implementation of a Thread-Parallel, GPU-Friendly Function Evaluation Library

Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU

TH‐C‐BRA‐03: Fast Iterative Cone Beam CT Reconstruction on GPGPU Using OpenCL

CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis

Scalable multi-GPU implementation of the MAGFLOW simulator

HIGH PRECISION INTEGER ADDITION, SUBTRACTION AND MULTIPLICATION WITH A GRAPHICS PROCESSING UNIT

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Kernel Launch Research Articles

Related Topics

Articles published on Kernel Launch

MrBayes sMC3

Accelerating sparse Cholesky factorization on GPUs

SU-G-TeP1-15: Toward a Novel GPU Accelerated Deterministic Solution to the Linear Boltzmann Transport Equation

Dynamic thread block launch

Accelerating DynEarthSol3D on tightly coupled CPU–GPU heterogeneous processors

On optimizing machine learning workloads via kernel fusion

Solving seven-equation model for compressible two-phase flow using multiple GPUs

Implementation of a Thread-Parallel, GPU-Friendly Function Evaluation Library

Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU

TH‐C‐BRA‐03: Fast Iterative Cone Beam CT Reconstruction on GPGPU Using OpenCL

CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis

Scalable multi-GPU implementation of the MAGFLOW simulator

HIGH PRECISION INTEGER ADDITION, SUBTRACTION AND MULTIPLICATION WITH A GRAPHICS PROCESSING UNIT