Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera,Mohammad Sadrosadat,David Novo,Shankar Balachandran,Onur Mutlu,Konstantinos Kanellopoulos,Ataberk Olgun

doi:10.1109/micro56248.2022.00015

Abstract

Long-latency load requests continue to limit the performance of modern high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: (1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and (2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy to solely determine that it needs to go off-chip. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: (1) accurately predict which load requests might go off-chip, and (2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters, byte offset of a load request). For every load request generated by the processor, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative load request directly to the main memory controller once the load’s physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative load request to finish, and thus Hermes completely hides the on-chip cache hierarchy access latency from the critical path of the correctly-predicted off-chip load. Our extensive evaluation using a wide range of workloads shows that Hermes provides consistent performance improvement on top of a state-of-the-art baseline system across a wide range of configurations with varying core count, main memory bandwidth, high-performance data prefetchers, and on-chip cache hierarchy access latencies, while incurring only modest storage overhead. The source code of Hermes is freely available at: https://github.com/CMU-SAFARI/Hermes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

The Cache Hierarchy
Jean-Loup Baer
-
Jean-Loup BaerJean-Loup Baer
07 Dec 2009
07 Dec 2009

Cache topology aware computation mapping for multicores
Mahmut Kandemir ... Yuanrui Zhnag
-
Mahmut Kandemir, et. al.Mahmut Kandemir ... Yuanrui Zhnag
05 Jun 2010
05 Jun 2010

Cache topology aware computation mapping for multicores
Mahmut Kandemir ... Shekhar Srikantaiah
ACM SIGPLAN Notices | VOL. 45
Mahmut Kandemir, et. al.Mahmut Kandemir ... Shekhar Srikantaiah
05 Jun 2010
ACM SIGPLAN Notices | VOL. 45

I<sup>2</sup>WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations
Jue Wang ... Yuan Xie
-
Jue Wang, et. al. Jue Wang ... Yuan Xie
01 Feb 2013
01 Feb 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Abstract

Talk to us

Similar Papers