Kernel Calls Research Articles

In this paper we describe and demonstrate a C++ code written to determine the trajectory of particles traversing oriented single crystals and a CUDA code written to evaluate the radiation spectra from charged particles with arbitrary trajectories. The CUDA/C++ code can evaluate both classical and quantum mechanical radiation spectra for spin 0 and 1/2 particles. We include multiple Coulomb scattering and energy loss due to radiation emission which produces radiation spectra in agreement with experimental spectra for both positrons and electrons. We also demonstrate how GPUs can be used to speed up calculations by several orders of magnitude. This will allow research groups with limited funding or sparse access to super computers to do numerical calculations as if it were a super computer. We show that one Titan V GPU can replace up to 100 Xeon 36 core CPUs running in parallel. We also show that choosing a GPU for a specific job will have great impact on the performance, as some GPUs have better double precision performance than others. Program summaryProgram Title: Radiation From Charged Particles Penetrating Oriented CrystalsProgram Files doi:http://dx.doi.org/10.17632/zp9gskrbvg.1Licensing provisions: MIT licenseProgramming language: C++ and CUDANature of problem: Solving the problem of calculating the radiation spectrum emitted from charged particles penetrating oriented single crystals. Exact solutions are not possible analytically, but with Monte Carlo simulations we achieve the closest agreements with experiments. This problem is particularly difficult because of the amount of integrals needed to evaluate the entire radiation spectrum.Solution method: By moving the evaluation of each radiation integral to a thread on a GPU, we are able to parallelize the problem massively, and thereby decrease computation times by several orders of magnitude. Each thread in a Kernel call to the GPU then handles one integral. As Kernel calls can be queued, and the time to evaluate the trajectory of a particle is relatively long, each thread on the CPU evaluates its own trajectory and calls a kernel to evaluate the radiation from that specific particle. In this way we minimize the downtime of the GPU, as there will always be a few CPU threads with a Kernel call ready to be evaluated on the GPU.

Read full abstract

Graphics processing units (GPUs) have become widely accepted as the computing platform of choice in many high performance computing domains. The availability of programming standards such as OpenCL are used to leverage the inherent parallelism offered by GPUs. Source code optimizations such as loop unrolling and tiling when targeted to heterogeneous applications have reported large gains in performance. However, given the power consumption of GPUs, platforms can exhaust their power budgets quickly. Better solutions are needed to effectively exploit the power-efficiency available on heterogeneous systems. In this work, we evaluate the power/performance efficiency of different optimizations used on heterogeneous applications. We analyze the power/performance trade-off by evaluating energy consumption of the optimizations. We compare the performance of different optimization techniques on four different fast Fourier transform implementations. Our study covers discrete GPUs, shared memory GPUs (APUs) and low power system-on-chip (SoC) devices, and includes hardware from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Kepler), Intel (Ivy Bridge) and Qualcomm (Snapdragon S4) as test platforms. The study identifies the architectural and algorithmic factors which can most impact power consumption. We explore a range of application optimizations which show an increase in power consumption by 27%, but result in more than 1.8 × increase in speed of performance. We observe up to an 18% reduction in power consumption due to reduced kernel calls across FFT implementations. We also observe an 11% variation in energy consumption among different optimizations. We highlight how different optimizations can improve the execution performance of a heterogeneous application, but also impact the power efficiency of the application. More importantly, we demonstrate that different algorithms implementing the same fundamental function (FFT) can perform with vast differences based on the target hardware and associated application design.

Read full abstract

Kernel Calls Research Articles

Related Topics

Articles published on Kernel Calls

Efficient parallel implementations to compute the diameter of a graph

GPU accelerated simulation of channeling radiation of relativistic particles

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU

Android vs. SEAndroid: An empirical assessment

Efficient Implementation of IPCP and DFP

Scalable Partitioning for Parallel Position Based Dynamics

Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platforms

A CUDA implementation of the Continuous Space Language Model

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors

Pseudo-random number generation for Brownian Dynamics and Dissipative Particle Dynamics simulations on GPU devices

ISIPC: Instant Synchronous Interprocess Communication

APCFS: Autonomous and Parallel Compressed File System

ARTK‐M2: A kernel for Ada tasking requirements: An implementation and an automatic generator

Threads and input/output in the synthesis kernal

The Sprite network operating system

A New Security Testing Method and Its Application to the Secure Xenix Kernel

The Synthesis Kernel

Measurement of cryptographic capability protection algorithms

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Kernel Calls Research Articles

Related Topics

Articles published on Kernel Calls

Efficient parallel implementations to compute the diameter of a graph

GPU accelerated simulation of channeling radiation of relativistic particles

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU

Android vs. SEAndroid: An empirical assessment

Efficient Implementation of IPCP and DFP

Scalable Partitioning for Parallel Position Based Dynamics

Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platforms

A CUDA implementation of the Continuous Space Language Model

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors

Pseudo-random number generation for Brownian Dynamics and Dissipative Particle Dynamics simulations on GPU devices

ISIPC: Instant Synchronous Interprocess Communication

APCFS: Autonomous and Parallel Compressed File System

ARTK‐M2: A kernel for Ada tasking requirements: An implementation and an automatic generator

Threads and input/output in the synthesis kernal

The Sprite network operating system

A New Security Testing Method and Its Application to the Secure Xenix Kernel

The Synthesis Kernel

Measurement of cryptographic capability protection algorithms