Abstract

SummaryContemporary general‐purpose graphic processing units (GPGPUs) successfully parallelize an application into thousands of concurrent threads with remarkably improved performance. Such massive threads will compete for the small‐sized first‐level data (L1D) cache, leading to an exaggerated cache‐thrashing problem, which may degrade the overall performance significantly. In this paper, we propose a selective victim cache design to enable better data locality and higher performance. Instead of a small fully associative structure, we first redesign the victim cache as a set associative structure that is equivalent to the original L1D cache to suit the GPGPU applications with massive concurrent threads. To keep the mostly used data in L1D for better operand service, we apply a simple prediction scheme to avoid costly block interchanges and evictions. To further save the area for data storage, we propose to leverage the unallocated registers and shared memory entries to hold the victim cache data. The experiments demonstrate that our proposed approach can increase the on‐chip data cache hit rate considerably and deliver a better performance with negligible changes to the baseline GPGPU architecture. For example, our selective victim cache design can improve the performance by 41.3% on average, achieving 54.7% increase in data cache hit rate and 21.8% reduction in block interchanges and evictions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call