Cooperative Caching for GPUs

Saumay Dublish,Nigel Topham,Vijay Nagarajan

doi:10.1145/3001589

Abstract

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad factors, such as cache thrashing and extensive multithreading. Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in the L2 access path therefore results in high memory access latencies. In memory-intensive applications, these latencies get exposed due to a lack of active compute threads to mask such high latencies. In this article, we aim to reduce the pressure on the shared L2 bandwidth, thereby reducing the memory access latencies that lie in the critical path. We identify significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s. We further show how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate intercore communication of shared data. We show that this technique reduces traffic to the L2 cache by an average of 29%, freeing up the bandwidth for other accesses. We also show that the CCN reduces the average memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across L1 caches. In doing so, the CCN incurs a nominal area and energy overhead of 1.3% and 2.5%, respectively. Notably, the performance improvement with our proposed CCN compares favorably to the performance improvement achieved by simply doubling the number of L2 banks by up to 34%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cooperative Caching for GPUs

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Architecture and Code Optimization

Lead the way for us

Journal: ACM Transactions on Architecture and Code Optimization	Publication Date: Dec 12, 2016
Citations: 26

Similar Papers

A Tool to Detect Performance Problems of Multi-threaded Programs on NUMA Systems
Liang Zhu ... Hai Jin
-
Liang Zhu, et. al.Liang Zhu ... Hai Jin
01 Aug 2016
01 Aug 2016

Exploring Cache Size and Core Count Tradeoffs in Systems with Reduced Memory Access Latency
Paulo C Santos ... Matthias Diener
-
Paulo C Santos, et. al.Paulo C Santos ... Matthias Diener
01 Feb 2016
01 Feb 2016

Accelerating In-Memory Database Selections Using Latency Masking Hardware Threads
Prerna Budhkar ... Skyler Windh
ACM Transactions on Architecture and Code Optimization | VOL. 16
Prerna Budhkar, et. al.Prerna Budhkar ... Skyler Windh
09 Apr 2019
ACM Transactions on Architecture and Code Optimization | VOL. 16

A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private L2 caches
Sungjune Youn ... Hyunhee Kim
-
Sungjune Youn, et. al.Sungjune Youn ... Hyunhee Kim
27 Aug 2007
27 Aug 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cooperative Caching for GPUs

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Architecture and Code Optimization