Data Prefetching Research Articles

Prefetching is one of several techniques for hiding and tolerating the large memory latencies of scalable multiprocessors. In this paper, we present a performance model for analyzing the limits and effectiveness of data prefetching. The model incorporates the effects of program behavior, network characteristics, cache coherency protocols, and memory consistency model. Our results indicate that, as long as there is enough extra network bandwidth, prefetching is very effective in hiding large latencies. In machines with sufficiently large caches to hold the program working set, the intra- and internode cache interference is marginally low enough to have any significant impact on prefetching performance. Furthermore, we reveal the fact that the effective prefetch distance plays a vital role and adapts extremely well to changes in cache miss rates and remote latencies, thus allowing prefetches to be more effective in hiding latency. An adaptive algorithm is provided to optimize the prefetch distance. This is based on the dynamic behavior of the application, interconnection network, and distributed caches and memories. This optimization of the prefetch distance constitutes a significant advantage of prefetching over other latency tolerating techniques, such as multithreading. We show that the prefetch distance can be chosen constant, program-dependent, or decided by performance information. The optimal distance could be adaptively determined using both compile-time and runtime conditions. Our results are therefore useful not only to compiler writers, but also for the development of runtime support systems in multiprocessors. In large-scale systems, in which network traffic control predominates the performance, the ultimate goal is to match program behavior with machine behavior.

The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large-scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they significantly lower performance. In this paper we evaluate the effectiveness of nonbinding software-controlled prefetching, as proposed in the Stanford DASH multiprocessor, to address this problem. The prefetches are nonbinding in the sense that the prefetched data is brought to a cache close to the processor, but is still available to the cache-coherence protocol to keep it consistent. Prefetching is software-controlled since the program must explicitly issue prefetch instructions. The paper presents results from detailed simulation studies done in the context of the Stanford DASH multiprocessor. Our results show that for applications with regular data access patterns—we evaluate a particle-based simulator used in aeronautics and an LU-decomposition application—prefetching can be very effective. It was easy to augment the applications to do prefetching and their performance was increased by 100–150% when we prefetched directly into the processor's cache. However, for applications with complex data usage patterns, prefetching was less successful. After much effort, the performance of a distributed-time logic simulation application that made extensive use of pointers and linked lists could be increased by only 30%. The paper also evaluates the effects of various hardware optimizations such as separate prefetch issue buffers, prefetching with exclusive ownership, lockup-free caches, and weaker memory consistency models on the performance of prefetching.

Data Prefetching Research Articles

Related Topics

Articles published on Data Prefetching

Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors

Effective hardware-based data prefetching for high-performance processors

Dissemination-based data delivery using broadcast disks

Performance and Optimization of Data Prefetching Strategies in Scalable Multiprocessors

Predicting and precluding problems with memory latency

Evaluating stream buffers as a secondary cache replacement

A load-instruction unit for pipelined processors

Concurrency control for high contention environments

Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Data Prefetching Research Articles

Related Topics

Articles published on Data Prefetching

Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors

Effective hardware-based data prefetching for high-performance processors

Dissemination-based data delivery using broadcast disks

Performance and Optimization of Data Prefetching Strategies in Scalable Multiprocessors

Predicting and precluding problems with memory latency

Evaluating stream buffers as a secondary cache replacement

A load-instruction unit for pipelined processors

Concurrency control for high contention environments

Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers