Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Todd Mowry,Anoop Gupta

doi:10.1016/0743-7315(91)90014-z

Abstract

The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large-scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they significantly lower performance. In this paper we evaluate the effectiveness of nonbinding software-controlled prefetching, as proposed in the Stanford DASH multiprocessor, to address this problem. The prefetches are nonbinding in the sense that the prefetched data is brought to a cache close to the processor, but is still available to the cache-coherence protocol to keep it consistent. Prefetching is software-controlled since the program must explicitly issue prefetch instructions. The paper presents results from detailed simulation studies done in the context of the Stanford DASH multiprocessor. Our results show that for applications with regular data access patterns—we evaluate a particle-based simulator used in aeronautics and an LU-decomposition application—prefetching can be very effective. It was easy to augment the applications to do prefetching and their performance was increased by 100–150% when we prefetched directly into the processor's cache. However, for applications with complex data usage patterns, prefetching was less successful. After much effort, the performance of a distributed-time logic simulation application that made extensive use of pointers and linked lists could be increased by only 30%. The paper also evaluates the effects of various hardware optimizations such as separate prefetch issue buffers, prefetching with exclusive ownership, lockup-free caches, and weaker memory consistency models on the performance of prefetching.

Full Text