Abstract

The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microar-chitectural prefetchers. The result is that current out-of-order superscalar processors spend the majority of their time stalled. While it is possible to build special-purpose architectures to exploit the fundamental memory-level parallelism, a microarchi-tectural technique to automatically improve their performance in conventional processors has remained elusive.Runahead execution is a tempting proposition for hiding latency in program execution. However, to achieve high memory-level parallelism, a standard runahead execution skips ahead of cache misses. In modern workloads, this means it only prefetches the first cache-missing load in each dependent chain. We argue that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then it could regain performance if it could stall on many at once. With this insight, we present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode bandwidth with reduced resource requirements, to achieve high degrees of memory-level parallelism at a much faster rate. Across a variety of memory-latency-bound indirect workloads, Vector Runahead achieves a 1.79× performance speedup on a large out-of-order superscalar system, significantly improving on state-of-the-art techniques.

Highlights

  • Modern-day workloads are poorly served by current outof-order superscalar cores

  • We present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once

  • We evaluate Vector Runahead through detailed simulation using a variety of graph, database and high-performance computing (HPC) workloads, and we report that Vector Runahead improves performance by 1.79× compared to a baseline out-of-order processor — a significant improvement over the state-of-the-art Precise Runahead Execution (PRE) technique [64] which achieves a speedup of 1.20×

Read more

Summary

INTRODUCTION

From databases [40], to graph workloads [49, 67], to HPC codes [11, 30], many workloads feature sparse, indirect memory accesses [9] characterized by high-latency cache misses that are unpredictable by today’s stride prefetchers [1, 19] For these workloads, even outof-order superscalar processors spend the majority of their time stalled, since their ample reorder buffer and issue queue resources are still insufficient to capture the memory-level parallelism necessary to hide today’s DRAM latencies. Even outof-order superscalar processors spend the majority of their time stalled, since their ample reorder buffer and issue queue resources are still insufficient to capture the memory-level parallelism necessary to hide today’s DRAM latencies Still, this performance gap is not insurmountable. Vector Runahead does not significantly impact system complexity, adding only 1.3 KB of new state over our baseline

Memory Stalls in Out-of-Order Cores
Limitations of Runahead Techniques
Managing Pipeline Resources During Runahead
REPRESENTATIVE CODE EXAMPLE
EVALUATION
Performance and Sensitivity Analysis
Vector Runahead Effectiveness
Auto-vectorization
Runahead Execution
Pre-Execution and Helper Threads
Architecturally Visible Prefetching
VIII. CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.