Vector Runahead

Ajeya Naithani,Timothy M Jones,Lieven Eeckhout,Sam Ainsworth

doi:10.1109/isca52012.2021.00024

Vector Runahead

Ajeya Naithani, Timothy M Jones + Show 2 more

Open Access

https://doi.org/10.1109/isca52012.2021.00024

Copy DOI

Abstract

The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microar-chitectural prefetchers. The result is that current out-of-order superscalar processors spend the majority of their time stalled. While it is possible to build special-purpose architectures to exploit the fundamental memory-level parallelism, a microarchi-tectural technique to automatically improve their performance in conventional processors has remained elusive.Runahead execution is a tempting proposition for hiding latency in program execution. However, to achieve high memory-level parallelism, a standard runahead execution skips ahead of cache misses. In modern workloads, this means it only prefetches the first cache-missing load in each dependent chain. We argue that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then it could regain performance if it could stall on many at once. With this insight, we present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode bandwidth with reduced resource requirements, to achieve high degrees of memory-level parallelism at a much faster rate. Across a variety of memory-latency-bound indirect workloads, Vector Runahead achieves a 1.79× performance speedup on a large out-of-order superscalar system, significantly improving on state-of-the-art techniques.

Highlights

Modern-day workloads are poorly served by current outof-order superscalar cores
We present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once
We evaluate Vector Runahead through detailed simulation using a variety of graph, database and high-performance computing (HPC) workloads, and we report that Vector Runahead improves performance by 1.79× compared to a baseline out-of-order processor — a significant improvement over the state-of-the-art Precise Runahead Execution (PRE) technique [64] which achieves a speedup of 1.20×

Summary

INTRODUCTION

From databases [40], to graph workloads [49, 67], to HPC codes [11, 30], many workloads feature sparse, indirect memory accesses [9] characterized by high-latency cache misses that are unpredictable by today’s stride prefetchers [1, 19] For these workloads, even outof-order superscalar processors spend the majority of their time stalled, since their ample reorder buffer and issue queue resources are still insufficient to capture the memory-level parallelism necessary to hide today’s DRAM latencies. Even outof-order superscalar processors spend the majority of their time stalled, since their ample reorder buffer and issue queue resources are still insufficient to capture the memory-level parallelism necessary to hide today’s DRAM latencies Still, this performance gap is not insurmountable. Vector Runahead does not significantly impact system complexity, adding only 1.3 KB of new state over our baseline

Memory Stalls in Out-of-Order Cores

Limitations of Runahead Techniques

Managing Pipeline Resources During Runahead

REPRESENTATIVE CODE EXAMPLE

EVALUATION

Performance and Sensitivity Analysis

Vector Runahead Effectiveness

Auto-vectorization

Runahead Execution

Pre-Execution and Helper Threads

Architecturally Visible Prefetching

VIII. CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Vector Runahead

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jun 1, 2021
Citations: 8	License type: cc-by

Similar Papers

An Experimental Approach of Modeling the Memory Level Parallelism
Wenjie Fu ... Ming Ling
-
Wenjie Fu, et. al.Wenjie Fu ... Ming Ling
01 Nov 2018
01 Nov 2018

An Analytical Cache Performance Evaluation Framework for Embedded Out-of-Order Processors Using Software Characteristics
Kecheng Ji ... Jianping Pan
ACM Transactions on Embedded Computing Systems | VOL. 17
Kecheng Ji, et. al.Kecheng Ji ... Jianping Pan
31 Jul 2018
ACM Transactions on Embedded Computing Systems | VOL. 17

AFEC: An analytical framework for evaluating cache performance in out-of-order processors
Kecheng Ji ... Jianping Pan
-
Kecheng Ji, et. al.Kecheng Ji ... Jianping Pan
01 Mar 2017
01 Mar 2017

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
Kim-Anh Tran ... Trevor E Carlson
-
Kim-Anh Tran, et. al.Kim-Anh Tran ... Trevor E Carlson
11 Jun 2018
11 Jun 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Vector Runahead

Abstract

Highlights

Summary

Talk to us

Similar Papers