Abstract
Tiled many-core processors are designed to integrate simple cores onto a single chip to take advantage of software-level parallelism, and these cores are interconnected via mesh-based networks to mitigate overheads such as limited throughput derived from traditional interconnects. As these processors become more prevalent, one unnoticed problem is that it is more likely for operating system (OS) designers to believe that these processors, which have multiple on-chip memory controllers, belong to the non-uniform memory access (NUMA) system. In this paper, we define novel models regarding the differentiation between uniform memory access and NUMA on tiled many-core processors from the perspective of the cache system to facilitate OS designers and application programmers in fully understanding the underlying hardware. Whether or not a tiled many-core processor belongs to the NUMA system, is determined by the cache system rather than how many memory controllers it has. The experimental results together with the novel models are able to explain why the (non-)significant performance difference can be observed on KNL and TILE-Gx72.
Highlights
KNL (Knights Landing) [1] from Intel, and TILE-Gx series of processors, including TILE-Gx36 [2] and TILE-Gx72 [3] from Mellanox Technologies, have recently emerged in the market as real tiled many-core processors
WORK We define novel models (UMAcache and NUMAcache) that are based on the cache coherence protocol on tiled manycore processors
The UMAcache model is determined when one physical page is distributed across all available on-chip tiles with the purpose of maintaining cache coherence, while the NUMAcache model is recognized by distributing a physical page across a portion of on-chip tiles
Summary
Knights Landing) [1] from Intel, and TILE-Gx series of processors, including TILE-Gx36 [2] and TILE-Gx72 [3] from Mellanox Technologies, have recently emerged in the market as real tiled many-core processors. KNL is viewed as UMAcache and NUMAcache systems with distinct hardware supports (all-to-all and SNC2/4 cluster modes), and relatively better program performance under UMAcache can be anticipated when (1) on-chip network congestion is not a problem, (2) a program is not aggressive to the cache system, and (3) the main overhead is from the memory system. Contributions: The most significant contribution shown in this paper is that we define novel models (UMAcache and NUMAcache) on tiled many-core processors It is based on the cache access latency between a requester tile and the home tile (and the owner tile together for KNL-like processors), instead of on the memory access latency for conventional models (UMA and NUMA).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have