Future Multi-core Architectures Research Articles

Shared memory is the most popular parallel programming model for multi-core processors, while message passing is generally used for large distributed machines. However, as the number of cores on a chip increases, the relative merits of shared memory versus message passing change, and we argue that message passing becomes a viable, high performing, and parallel programming model. To demonstrate this hypothesis, we compare a shared memory architecture with a new message passing architecture on a suite of applications tuned for each system independently. Perhaps surprisingly, the fundamental behaviors of the applications studied in this work, when optimized for both models, are very similar to each other, and both could execute efficiently on multicore architectures despite many implementations being different from each other. Furthermore, if hardware is tuned to support message passing by supporting bulk message transfer and the elimination of unnecessary coherence overheads, and if effective support is available for global operations, then some applications would perform much better on a message passing architecture. Leveraging our insights, we design a message passing architecture that supports both memory-to-memory and cache-to-cache messaging in hardware. With the new architecture, message passing is able to outperform its shared memory counterparts on many of the applications due to the unique advantages of the message passing hardware as compared to cache coherence. In the best case, message passing achieves up to a 34% increase in speed over its shared memory counterpart, and it achieves an average 10% increase in speed. In the worst case, message passing is slowed down in two applications—CG (conjugate gradient) and FT (Fourier transform)—because it could not perform well on the unique data sharing patterns as its counterpart of shared memory. Overall, our analysis demonstrates the importance of considering message passing as a high performing and hardware-supported programming model on future multicore architectures.

Read full abstract

Growing processing demand on multitasking real-time systems can be met by employing scalable multicore architectures. For such environments, locking cache lines for hard real-time systems ensures timing predictability of data references and may lower worst-case execution time. This work studies the benefits of cache locking on massive multicore architectures with private caches in the context of hard real-time systems. In shared cache architectures, the cache is a single resource shared among all of the tasks. However, in scalable cache architectures with private caches, conflicts exist only among the tasks scheduled on one core. This calls for a cache-aware allocation of tasks onto cores. The objective of this work is to increase the predictability of memory accesses resolved by caches while reducing the number of cores for a given task set. This allows designers to reduce the footprint of their subsystem of real-time tasks and thereby cost, either by choosing a product with fewer cores as a target or to allow more subsystems to be co-located on a given fixed number of cores. Our work proposes a novel variant of the cache-unaware First Fit Decreasing (FFD) algorithm called Naive locked First Fit Decreasing (NFFD) policy. We propose two cache-aware static scheduling schemes: (a) Greedy First Fit Decreasing (GFFD) and (b) Colored First Fit Decreasing (CoFFD) for task sets where tasks do not have intratask conflicts among locked regions (Scenario A). NFFD is capable of scheduling high utilization task sets that FFD cannot schedule. Experiments also show that CoFFD consistently outperforms GFFD, resulting in a lower number of cores and lower system utilization. CoFFD reduces the number of core requirements by 30% to 60% compared to NFFD. For a more generic case where tasks have intratask conflicts, we split the task partitioning between two phases: task selection and task allocation (Scenario B). Instead of resolving conflicts at a global level, these algorithms resolve conflicts among regions while allocating a task onto a core and unlocking at region level instead of task level. We show that a combination of dynamic ordering (task selection) with Chaitin’s Coloring (task allocation) scheme reduces the number of cores required by up to 22% over a basic scheme (in a combination of monotone ordering and regional FFD). Regional unlocking allows this scheme to outperform CoFFD for medium utilization task sets from Scenario A. However, CoFFD performs better than any other scheme for high utilization task sets from Scenario A. Overall, this work is unique in considering the challenges of future multicore architectures for real-time systems and provides key insights into task partitioning and cache-locking mechanisms for architectures with private caches.

Read full abstract

Future Multi-core Architectures Research Articles

Related Topics

Articles published on Future Multi-core Architectures

Effective On-Chip Communication for Message Passing Programs on Multi-Core Processors

Exploiting Shared-Memory to Steer Scalability of Fault Simulation Using Multicore Systems

Modeling the Power Variability of Core Speed Scaling on Homogeneous Multicore Systems

Static Task Partitioning for Locked Caches in Multicore Real-Time Systems

Guest Editors' Introduction: Silicon Nanophotonics for Future Multicore Architectures

Silicon Nanophotonics for Future Multicore Architectures: Opportunities and Challenges

Three-Dimensional Stacked Nanophotonic Network-on-Chip Architecture with Minimal Reconfiguration

Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs

Scalable multicore architectures for long DNA sequence comparison

Towards modeling & analysis of consolidated CMP servers

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Future Multi-core Architectures Research Articles

Related Topics

Articles published on Future Multi-core Architectures

Effective On-Chip Communication for Message Passing Programs on Multi-Core Processors

Exploiting Shared-Memory to Steer Scalability of Fault Simulation Using Multicore Systems

Modeling the Power Variability of Core Speed Scaling on Homogeneous Multicore Systems

Static Task Partitioning for Locked Caches in Multicore Real-Time Systems

Guest Editors' Introduction: Silicon Nanophotonics for Future Multicore Architectures

Silicon Nanophotonics for Future Multicore Architectures: Opportunities and Challenges

Three-Dimensional Stacked Nanophotonic Network-on-Chip Architecture with Minimal Reconfiguration

Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs

Scalable multicore architectures for long DNA sequence comparison

Towards modeling &amp; analysis of consolidated CMP servers

Towards modeling & analysis of consolidated CMP servers