Abstract

SummaryDue to the GPU's complex memory system and massive thread‐level parallelism, application programmers often have difficulty optimizing GPU programs. An essential approach to memory optimization is to utilize low‐latency on‐chip memory to avoid high latency of off‐chip memory accesses. Shared memory is an on‐chip memory, which is explicitly managed by programmers. Shared memory has a read/write latency similar to that of the L1 cache, but poor data management can degrade performance. In this paper, we present a static code transformation that preloads dataset in GPU's shared memory. Our static analysis primarily targets global memory requests with high thread‐density for preloading in shared memory. The thread‐dense memory access pattern is a pattern in which many threads efficiently manage the address space of shared memory, as well as reuse the same data in a thread block. We limit the usage of shared memory so that thread‐level parallelism remains at the same level when selecting datasets for preloading. Finally, our source‐to‐source compiler allows to preload selected datasets in shared memory by transforming non‐optimized GPU kernel code. Our methods achieve 1.26× and 1.62× speedups on average (geometric mean), respectively with GTX980 and P100 GPUs.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call