Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

Parthasarathy Ranganathan,Vijay S Pai,Sarita V Adve

doi:10.1145/258492.258512

Abstract

This paper studies techniques to improue the performance of memory consistency models for shared-memory multiprocessors with ILP processors. The first part of this paper extends earlier work by studying the impact of current hardware optimization to memory consistency implementations, hardware-controlled non-binding prefetching and speculative load execution, on the performance of the processor consistency (PC) memory model. We find that the optimized implementation of PC performs significant tly better than the best implementation of sequential consistency (SC) in some cases because PC relaxes the store-to-load ordering constraint of SC. Nevertheless, release consistency (RC) provides significant benefits over PC in some cases, because PC’ suffers from the negative ef7ects of premature store prefetches and insufficient memory queue sizes. The second part of the paper proposes and evaluates a new technique, speculative retirement, to improve the performance of SC. Speculative retirement alleviates the impact of the store-to-load constraint of SC by allowing loads and subsequent instructions to speculatively commit or retire, even while a previous store is outstanding. Speculative retirement needs additional hardware support (in the form of a history bu~er) to recover from possible consistency violations due to such speculative retires. With a 64 element history bufler, speculative retirement reduces the execution time gap between SC and PC to within 11% for ail our applications on our base architecture; a significant, though reduced, gap still remains between SC and RC. The third part of our paper evaluates the interactions of the various techniques with larger instruction window sizes. When increasing instruction window size, initially, the previous best implementations of all models generally improve in performance due to increased load and store overlap. With further increases, the performance of PC and RC stabilizes while that of SC often degrades (due to negative eflects of “This work is supported in part by the National Science Foundation under Grant NcJ. CCR-9410457, CCR-9502500, and CDA-9502791, and the Texas Advanced Technology Program under Grant No. 003604016. Vijay S. Pai is also supported by a Fannie and John Hertz Foundation Fellowship. Permission to make digit: d/lmrdcopies of all or IMUIot’this nmteria I Iiir personal or classroom use is granted without fee provided IIUNW copies are not made or distributed I’orprofit or commercial advantage. the wspvright notice, the title of the publication and its dak appesr, and nuticx w given tlmt copyright is by permission of the ACM. IIW.TO copy Aerwiw, 10 republish. 10 post on servers or 10 redistribute 10 IisLs requires <pccilic permission antior fee .V’A4 97 Newport, Rhode Iskmd I.ISA Copyright 1997 ACM 0-89791 -890-8/97/06 ..$3.50 previous optimizations), widening the gap between the models. At low base instruction window sizes, speculative retirement is sometimes outperformed by an equivalent increase in instruction window size (becausethe latter also provides load overlap). However, beyond the point where RC stabilizes, speculative retirement gives comparable or better benefit than an equivalent instruction window increase, with possibly less complexity.

Full Text