Multi-stage coordinated prefetching for present-day processors

Sanyam Mehta,Pen-Chung Yew,Zhenman Fang,Antonia Zhai

doi:10.1145/2597652.2597660

Abstract

Data prefetching is an important technique for hiding memory latency. Latest microarchitectures provide support for both hardware and software prefetching. However, the architectural features supporting either are different. In addition, these features can vary from one architecture to another. As a result, the choice of the right prefetching strategy is non-trivial for both the programmers and compiler-writers.In this paper, we study different prefetching techniques in the context of different architectural features that support prefetching on existing hardware platforms. These features include, the size of the line fill buffer or the Miss Status Handling Registers servicing prefetch requests at each level of cache, the aggressiveness and effectiveness of the hardware prefetchers, interaction between software prefetch requests and the hardware prefetcher, the nature of the instruction pipeline (in-order/out-of-order execution), etc. Our experiments with two widely different processors, a latest multi-core (SandyBridge) and a many-core (Xeon Phi) processor, show that these architectural features have a significant bearing on the prefetching choice in a given source program, so much so that the best prefetching technique on SandyBridge performs worst on Xeon Phi and vice-versa. Based on our study of the interaction between the host architecture and prefetching, we find that coordinated multi-stage prefetching that brings data closer to the core in stages, yields best performance. On SandyBridge, the mid-level cache hardware prefetcher and L1 software prefetching coordinate to achieve this end, whereas on Xeon Phi, pure software prefetching proves adequate. We implement our algorithm in the ROSE source-to-source compiler framework. Experimental results demonstrate that coordinated prefetching achieves a speed-up (geometric mean over benchmarks from the SPEC suite) of 1.55X and 1.3X against the hardware prefetcher and the Intel compiler, respectively, on Xeon Phi. On SandyBridge, a speed-up of 1.08X is obtained over its effective hardware prefetcher.

Full Text