Abstract

Modern graphic processing units (GPUs) are not only able to perform graphics rendering, but also perform general purpose parallel computations (GPGPUs). It has been shown that the GPU L1 data cache and the on chip interconnect bandwidth are important sources of performance bottlenecks and inefficiencies in GPGPUs. Through this work, we aim to understand the sources of inefficiencies and possible opportunities for more efficient cache and interconnect bandwidth management on the GPUs. We do so by understanding the predictability of reuse behavior and spatial utilization of cache lines using program level information such as the instruction PC, and runtime behavior such as the extent of memory divergence. Through our characterization results, we demonstrate that a) PC, and memory divergence can be used to efficiently bypass zero reuse cache lines from the cache; b) memory divergence information can further be used to dynamically insert cache lines of varying size granularities based on their spatial utilization. Finally, based on the insights derived through our characterization, we design a simple Instruction and memory Divergence cache management method that is able to achieve an average of 71% performance improvement for a wide variety of cache and interconnect sensitive applications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call