Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking

Zhenman Fang,Pen-Chung Yew,Sanyam Mehta,Antonia Zhai,Binyu Zang,James Greensky,Gautham Beeraka

doi:10.1145/2687356

Abstract

As multicore and many-core architectures evolve, their memory systems are becoming increasingly more complex. To bridge the latency and bandwidth gap between the processor and memory, they often use a mix of multilevel private/shared caches that are either blocking or nonblocking and are connected by high-speed network-on-chip. Moreover, they also incorporate hardware and software prefetching and simultaneous multithreading (SMT) to hide memory latency. On such multi- and many-core systems, to incorporate various memory optimization schemes using compiler optimizations and performance tuning techniques, it is crucial to have microarchitectural details of the target memory system. Unfortunately, such details are often unavailable from vendors, especially for newly released processors. In this article, we propose a novel microbenchmarking methodology based on short elapsed-time events (SETEs) to obtain comprehensive memory microarchitectural details in multi- and many-core processors. This approach requires detailed analysis of potential interfering factors that could affect the intended behavior of such memory systems. We lay out effective guidelines to control and mitigate those interfering factors. Taking the impact of SMT into consideration, our proposed methodology not only can measure traditional cache/memory latency and off-chip bandwidth but also can uncover the details of software and hardware prefetching units not attempted in previous studies. Using the newly released Intel Xeon Phi many-core processor (with in-order cores) as an example, we show how we can use a set of microbenchmarks to determine various microarchitectural features of its memory system (many are undocumented from vendors). To demonstrate the portability and validate the correctness of such a methodology, we use the well-documented Intel Sandy Bridge multicore processor (with out-of-order cores) as another example, where most data are available and can be validated. Moreover, to illustrate the usefulness of the measured data, we do a multistage coordinated data prefetching case study on both Xeon Phi and Sandy Bridge and show that by using the measured data, we can achieve 1.3X and 1.08X performance speedup, respectively, compared to the state-of-the-art Intel ICC compiler. We believe that these measurements also provide useful insights into memory optimization, analysis, and modeling of such multicore and many-core architectures.

Full Text