Abstract

GPU provides a large number of registers to each thread. It is important to make good use of this resource in order to achieve high performance. With a high number of registers available, we can store more data associated with each thread. However, with increase in number of registers, the number of active threads on GPU decreases; this leads to reduce GPU occupancy and hence degrade performance. Hence it is necessary to study the impact of register pressure on performance of the applications running on GPU.The poster presents a framework which will estimate perthread register pressure for a given CUDA program and will implement thread coarsening transformation. In order to analyze the effect of register pressure on GPU code, the poster includes experimental results which confirm that the changes in the register pressure of a given kernel have an effect upon the memory accesses on GPU.Both the PTX analyzer and the CUDA profiler can provide fairly accurate per-thread register utilization information. Nevertheless, because we apply the code transformations on CUDA source, we require a method to estimate register pressure at the source level. To this end, we have design the register estimation model. The register predictions done by our model are fairly comparable to actual register count obtained by CUDA profiler. This estimated knowledge can be applied to guide high-level code transformations, such as loop fusion and unroll-andjam, for increased effectiveness.We have also described an automatic approach for controlling thread granularity in GPU kernels by applying thread coarsening transformation. We have presented an overview of the code restructuring and tuning framework of CUDA kernel, that implements thread coarsening. The framework leverages some existing tools, including NVCC for compiling CUDA kernels and CUDA profiler for collection of performance metrics. The experiments are done on CUDA SDK examples. We have also created a synthetic micro-benchmark which shows the impact of coarsening transformation in ideal circumstances. The experimental results for thread coarsening depict increased overall performance for kernels that exhibit inter-thread data locality that outweighs the costs of lower occupancy by improving register reuse and reduce memory traffic.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call