The Effects of Block Size on the Performance of Coherent Caches in Shared-Memory Multiprocessors

Cezary Dubnicki

doi:10.21236/ada272838

Abstract

Several studies have shown that the performance of coherent caches depends on the relationship between the cache block size and the granularity of sharing and locality exhibited by the program. Large cache blocks exploit processor and spatial locality, but may cause unnecessary cache invalidations due to false sharing. Small cache blocks can reduce the number of cache invalidations, but increase the number of bus or network transactions required to load data into the cache. In this dissertation we use reference traces from a variety of parallel programs and detailed simulation of a scalable shared-memory multiprocessor to examine the effects of cache block size on the performance of coherent caches and quantify this impact with respect to the network bandwidth and latency. Our results suggest that, regardless of the available bandwidth or latency, applications with good spatial locality favor long cache lines, and for these applications the relative benefits of longer cache lines increase with the bandwidth and latency. For those applications with poor spatial locality, the best choice of cache line size is determined by the product of the network bandwidth and latency, and for these applications, the performance penalty induced by long cache lines increases as this product decreases. We also found that the performance penalty of a mismatch between the cache block size and the sharing patterns exhibited by applications increases with an increase in latency, and decreases with an increase in bandwidth, and can be substantial even on machines with infinite bandwidth. To reduce this penalty we propose a new cache organization that adjusts the size of data blocks dynamically according to recent reference patterns. In this new cache organization, blocks are split in two when false sharing occurs, and merged back together to exploit spatial locality. Results of simulations in which we varied both the network bandwidth and latency indicate that, for the suite of applications we consider, the adjustable block size cache organization performs better than every fixed block size alternative (including caches that prefetch multiple lines). In addition,the performance benefits of adjustable block size caches increase with an increase in the ratio of network latency to bandwidth. We conclude that adjusting the block size in response to reference behavior can significantly improve the performance of coherent caches, especially when there is variability in the granularity of sharing exhibited by applications.

Full Text