Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors

Philippas Tsigas,Yi Zhang

doi:10.1145/378420.378810

Abstract

Parallel programs running on shared memory multiprocessors coordinate via shared data objects/structures. To ensure the consistency of the shared data structures, programs typically rely on some forms of software synchronisations. Unfortunately typical software synchronisation mechanisms usually result in poor performance because they produce large amounts of memory and interconnection network contention and, more significantly, because they produce convoy effects that degrade significantly in multiprogramming environments: if one process holding a lock is preempted, other processes on different processors waiting for the lock will not be able to proceed. Researchers have introduced non-blocking synchronisation to address the above problems. Non-blocking implementations allow multiple tasks to access a shared object at the same time, but without enforcing mutual exclusion to accomplish this. However, its performance implications are not well understood on modern systems or on real applications. In this paper we study the impact of the non-blocking synchronisation on parallel applications running on top of a modern, 64 processor, cache-coherent, shared memory multiprocessor system: the SGI Origin 2000. Cache-coherent non-uniform memory access (ccNUMA) shared memory multiprocessor systems have attracted considerable research and commercial interest in the last years. In addition to the performance results on a modern system, we also investigate the key synchronisation schemes that are used in multiprocessor applications and their efficient transformation to non-blocking ones. Evaluating the impact of the synchronisation performance on applications is important for several reasons. First, micro-benchmarks can not capture every aspect of primitive performance. It is hard to predict the primitive impact on the application performance. For example, a look or barrier that generates a lot of additional network traffic might have little impact on applications. Second, even in applications that spend significant time in synchronisation operations, the synchronisation time might be dominated by wait time due to load imbalance and lock serialisation in the application, which better implementations of synchronisation may not be helpful in reducing. Third, micro-benchmarks rarely capture (generate) scenarios that occur in real applications.We evaluated the benefits of non-blocking synchronisation in a range of applications running on top of modern realizations of shared-memory multiprocessors, a 64 processor SGI Origin 2000. In this evaluation, i) we used a big set of applications with different communication characteristics, making sure that we include also applications that do not spend a lot of time in synchronisation, ii) we also modified all the lock-based synchronisation points of these applications when possible. The goal of our work was to provide an in depth understanding of how non-blocking can improve the performance of modern parallel applications. More specifically, the main issues addressed in this paper include: i) The architectural implications of the ccNUMA on the design of non-blocking synchronisation. ii) The identification of the basic locking operations that parallel programmers use in their applications. iii) The efficient non-blocking implementation of these synchronisation operations. iv) The experimental comparison of the lock-based and lock-free versions of the respective applications on a cache-coherent non-uniform memory access shared memory multiprocessor system. v) The identification of the structural differences between applications that benefit more from non-blocking synchronisation than others. We selected to examine these issues, on a 64 processor SGI Origin 2000 multiprocessor system. This machine is attractive for the study because it provides an aggressive communication architecture and support for both in cache and at memory synchronisation primitives. It should be clear however that the conclusions and the methods presented in this paper have general applicability in other realizations of cache-coherent non-uniform memory access machines. Our results can benefit the parallel programmers in two ways. First, to understand the benefits of non-blocking synchronisation, and then to transform some typical lock-based synchronisation operations that are probably used in their programs to non-blocking ones by using the general translations that we provide in this paper.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors
Philippas Tsigas ... Yi Zhang
ACM SIGMETRICS Performance Evaluation Review | VOL. 29
Philippas Tsigas, et. al.Philippas Tsigas ... Yi Zhang
01 Jun 2001
ACM SIGMETRICS Performance Evaluation Review | VOL. 29

ASCOMA: an adaptive hybrid shared memory architecture
Chen-Chi Kuo ... M Swanson
-
Chen-Chi Kuo, et. al. Chen-Chi Kuo ... M Swanson
10 Aug 1998
10 Aug 1998

Performance characteristics of the SPEC OMP2001 benchmarks
Vishal Aslot ... Rudolf Eigenmann
ACM SIGARCH Computer Architecture News | VOL. 29
Vishal Aslot, et. al.Vishal Aslot ... Rudolf Eigenmann
01 Dec 2001
Performance characteristics of the SPEC OMP2001 benchmarks
Vishal Aslot ... Rudolf Eigenmann

A good data allocation strategy on non-uniform memory access architecture
Xiaomei Guo ... Haiyun Han
-
Xiaomei Guo, et. al.Xiaomei Guo ... Haiyun Han
01 May 2017
01 May 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors

Abstract

Talk to us

Similar Papers