Software Prefetching for Indirect Memory Accesses

Sam Ainsworth,Timothy M Jones

doi:10.1145/3319393

Abstract

Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited. This article develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instructions are architecture dependent and the class of programs that are particularly amenable. Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3× for an Intel Haswell processor, 1.1× for both an ARM Cortex-A57 and Qualcomm Kryo, 1.2× for a Cortex-72 and an Intel Kaby Lake, and 1.35× for an Intel Xeon Phi Knight’s Landing, each of which is an out-of-order core, and performance improvements of 2.1× and 2.7× for the in-order ARM Cortex-A53 and first generation Intel Xeon Phi.

Highlights

Many modern workloads for high-performance compute (HPC) and data processing are heavily memory-latency bound [10, 13, 18, 25]
These techniques do not work for irregular access patterns, as seen in linked data structures, and in indirect memory accesses, where the addresses loaded are based on indices stored in arrays
We evaluate the factors that affect software prefetching in different systems

Summary

Introduction

Many modern workloads for high-performance compute (HPC) and data processing are heavily memory-latency bound [10, 13, 18, 25]. The traditional solution to this has been prefetching: using hardware to detect common access patterns such as strides [4, 28], and bring the required data into fast cache memory before it is requested by the processor. These techniques do not work for irregular access patterns, as seen in linked data structures, and in indirect memory accesses, where the addresses loaded are based on indices stored in arrays. Prefetching too far ahead risks cache pollution and the data being evicted before use; prefetching too late risks the data not being fetched early enough to mask the cache miss These factors can often cause software prefetches to under-perform, or show no benefit, even in seemingly ideal situations

Objectives

Methods

Results

Conclusion