ANALYZING THE PERFORMANCE IMPACT OF PARALLEL LATENCY IN THE PIPELINE

Matthew Constant

doi:10.23860/thesis-constant-matthew-2020

Abstract

This work introduces a concept coined overlap latency which is shown to severely limit performance in several types of benchmarks. This overlap latency is only completely removed when both branch mispredictions and cache misses are removed in tandem, rather than improved in isolation. Since most current research investigates improvements to branch predictions or cache behavior - and not both - proposed techniques are not able to unlock this extra performance gain. To demonstrate this concept benchmarks are evaluated using four configurations: baseline which uses current state-of-the-art branch prediction and cache prefetching, perfect-bp which emulates perfect branch prediction direction, perfect-cache which emulates a perfect L1 data cache, and perfect which combines perfect-bp and perfect-cache. In addition, detailed analysis on select benchmarks is conducted to show the cause of overlap latency as well as the effect this has on an out-of-order execution CPU. Benchmarks were found to have the potential for up to an additional 229% IPC compared to that expected based on individual performance gain from branch prediction and cache.

Full Text