Cache-coherent Non-uniform Memory Access Research Articles

Parallel programs running on shared memory multiprocessors coordinate via shared data objects/structures. To ensure the consistency of the shared data structures, programs typically rely on some forms of software synchronisations. Unfortunately typical software synchronisation mechanisms usually result in poor performance because they produce large amounts of memory and interconnection network contention and, more significantly, because they produce convoy effects that degrade significantly in multiprogramming environments: if one process holding a lock is preempted, other processes on different processors waiting for the lock will not be able to proceed. Researchers have introduced non-blocking synchronisation to address the above problems. Non-blocking implementations allow multiple tasks to access a shared object at the same time, but without enforcing mutual exclusion to accomplish this. However, its performance implications are not well understood on modern systems or on real applications. In this paper we study the impact of the non-blocking synchronisation on parallel applications running on top of a modern, 64 processor, cache-coherent, shared memory multiprocessor system: the SGI Origin 2000. Cache-coherent non-uniform memory access (ccNUMA) shared memory multiprocessor systems have attracted considerable research and commercial interest in the last years. In addition to the performance results on a modern system, we also investigate the key synchronisation schemes that are used in multiprocessor applications and their efficient transformation to non-blocking ones. Evaluating the impact of the synchronisation performance on applications is important for several reasons. First, micro-benchmarks can not capture every aspect of primitive performance. It is hard to predict the primitive impact on the application performance. For example, a look or barrier that generates a lot of additional network traffic might have little impact on applications. Second, even in applications that spend significant time in synchronisation operations, the synchronisation time might be dominated by wait time due to load imbalance and lock serialisation in the application, which better implementations of synchronisation may not be helpful in reducing. Third, micro-benchmarks rarely capture (generate) scenarios that occur in real applications. We evaluated the benefits of non-blocking synchronisation in a range of applications running on top of modern realizations of shared-memory multiprocessors, a 64 processor SGI Origin 2000. In this evaluation, i) we used a big set of applications with different communication characteristics, making sure that we include also applications that do not spend a lot of time in synchronisation, ii) we also modified all the lock-based synchronisation points of these applications when possible. The goal of our work was to provide an in depth understanding of how non-blocking can improve the performance of modern parallel applications. More specifically, the main issues addressed in this paper include: i) The architectural implications of the ccNUMA on the design of non-blocking synchronisation. ii) The identification of the basic locking operations that parallel programmers use in their applications. iii) The efficient non-blocking implementation of these synchronisation operations. iv) The experimental comparison of the lock-based and lock-free versions of the respective applications on a cache-coherent non-uniform memory access shared memory multiprocessor system. v) The identification of the structural differences between applications that benefit more from non-blocking synchronisation than others. We selected to examine these issues, on a 64 processor SGI Origin 2000 multiprocessor system. This machine is attractive for the study because it provides an aggressive communication architecture and support for both in cache and at memory synchronisation primitives. It should be clear however that the conclusions and the methods presented in this paper have general applicability in other realizations of cache-coherent non-uniform memory access machines. Our results can benefit the parallel programmers in two ways. First, to understand the benefits of non-blocking synchronisation, and then to transform some typical lock-based synchronisation operations that are probably used in their programs to non-blocking ones by using the general translations that we provide in this paper.

Read full abstract

Commercial cache-coherent nonuniform memory access (ccNUMA) systems often require extensive investments in hardware design and operating system support. A different approach to building these systems is to use Standard High Volume (SHV) hardware and stock software components as building blocks and assemble them with minimal investments in hardware and software. This design approach trades the performance advantages of specialized hardware design for simplicity and implementation speed, and relies on application-level tuning for scalability and performance. We present our experience with this approach in this paper. We built a 16-way ccNUMA Intel system consisting of four commodity four-processor Fujitsu® Teamserver™ SMPs connected by a Synfinity™ cache-coherent switch. The system features a total of sixteen 350-MHz Intel® Xeon™ processors and 4 GB of physical memory, and runs the standard commercial Microsoft Windows NT® operating system. The system can be partitioned statically or dynamically, and uses an innovative, combined hardware/software approach to support application-level performance tuning. On the hardware side, a programmable performance-monitor card measures the frequency of remote-memory accesses, which constitute the predominant source of performance overhead. The monitor does not cause any performance overhead and can be deployed in production mode, providing the possibility for dynamic performance tuning if the application workload changes over time. On the software side, the Resource Set abstraction allows application-level threads to improve performance and scalability by specifying their execution and memory affinity across the ccNUMA system. Results from a performance-evaluation study confirm the success of the combined hardware/software approach for performance tuning in computation-intensive workloads. The results also show that the poor local-memory bandwidth in commodity Intel-based systems, rather than the latency of remote-memory access, is often the main contributor to poor scalability and performance. The contributions of this work can be summarized as follows: • The Resource Set abstraction allows control over resource allocation in a portable manner across ccNUMA architectures; we describe how it was implemented without modifying the operating system. • An innovative hardware design for a programmable performance-monitor card is designed specifically for a ccNUMA environment and allows dynamic, adaptive performance optimizations. • A performance study shows that performance and scalability are often limited by the local-memory bandwidth rather than by the effects of remote-memory access in an Intel-based architecture.

Read full abstract

Cache-coherent Non-uniform Memory Access Research Articles

Related Topics

Articles published on Cache-coherent Non-uniform Memory Access

A highly efficient 3D level-set grain growth algorithm tailored for ccNUMA architecture

NumaGiC

NumaGiC

NUMA-Aware Multicore Matrix Multiplication

CC-NUMA Oriented Conflict Preventing Method for Transactional Memory

Moving address translation closer to memory in distributed shared-memory multiprocessors

Shared memory multiprocessor architectures for software ip routers

Optimizing operating system performance for CC‐NUMA architectures

Design and analysis of static memory management policies for CC-NUMA multiprocessors

Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM

Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors

Experience with building a commodity Intel-based ccNUMA system

Architecture and design of AlphaServer GS320

Architecture and design of AlphaServer GS320

Utilization of cache area in on-chip multiprocessor

Architecture and design of AlphaServer GS320

Impact of CC-NUMA memory management policies on the application performance of multistage switching networks

Parallelization of a dynamic unstructured algorithm using three leading programming paradigms

Design and evaluation of a switch cache architecture for CC-NUMA multiprocessors

Performance evaluation and cost analysis of cache protocol extensions for shared-memory multiprocessors

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cache-coherent Non-uniform Memory Access Research Articles

Related Topics

Articles published on Cache-coherent Non-uniform Memory Access

A highly efficient 3D level-set grain growth algorithm tailored for ccNUMA architecture

NumaGiC

NumaGiC

NUMA-Aware Multicore Matrix Multiplication

CC-NUMA Oriented Conflict Preventing Method for Transactional Memory

Moving address translation closer to memory in distributed shared-memory multiprocessors

Shared memory multiprocessor architectures for software ip routers

Optimizing operating system performance for CC‐NUMA architectures

Design and analysis of static memory management policies for CC-NUMA multiprocessors

Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM

Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors

Experience with building a commodity Intel-based ccNUMA system

Architecture and design of AlphaServer GS320

Architecture and design of AlphaServer GS320

Utilization of cache area in on-chip multiprocessor

Architecture and design of AlphaServer GS320

Impact of CC-NUMA memory management policies on the application performance of multistage switching networks

Parallelization of a dynamic unstructured algorithm using three leading programming paradigms

Design and evaluation of a switch cache architecture for CC-NUMA multiprocessors

Performance evaluation and cost analysis of cache protocol extensions for shared-memory multiprocessors