Prefetching is one of several techniques for hiding and tolerating the large memory latencies of scalable multiprocessors. In this paper, we present a performance model for analyzing the limits and effectiveness of data prefetching. The model incorporates the effects of program behavior, network characteristics, cache coherency protocols, and memory consistency model. Our results indicate that, as long as there is enough extra network bandwidth, prefetching is very effective in hiding large latencies. In machines with sufficiently large caches to hold the program working set, the intra- and internode cache interference is marginally low enough to have any significant impact on prefetching performance. Furthermore, we reveal the fact that the effective prefetch distance plays a vital role and adapts extremely well to changes in cache miss rates and remote latencies, thus allowing prefetches to be more effective in hiding latency. An adaptive algorithm is provided to optimize the prefetch distance. This is based on the dynamic behavior of the application, interconnection network, and distributed caches and memories. This optimization of the prefetch distance constitutes a significant advantage of prefetching over other latency tolerating techniques, such as multithreading. We show that the prefetch distance can be chosen constant, program-dependent, or decided by performance information. The optimal distance could be adaptively determined using both compile-time and runtime conditions. Our results are therefore useful not only to compiler writers, but also for the development of runtime support systems in multiprocessors. In large-scale systems, in which network traffic control predominates the performance, the ultimate goal is to match program behavior with machine behavior.
Read full abstract