Atomic Operations

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Appendix B discusses the role of atomic operations in parallel computing and the available function in CUDA. An example is provided showing the use of atomicCAS to implement another atomic operation.

Similar Papers
  • Research Article
  • 10.18523/2617-3808.2018.33-39
Implementing HTTP-Service in a Functional Domain-Specific Language Based on Free Monads
  • Oct 16, 2018
  • NaUKMA Research Papers. Computer Science
  • Oleksii Savenkov

Implementing web-services in a pure functional way is not a trivial task, because it is common to a query database (or a different type of mutable data store) on all the stages of the server request processing. This complicates separation of the pure functional core and imperative shell of the program, which is the primary principle of the functional software architecture. This article focuses on resolving this problem by using free monads, which makes it possible to implement a new domain-specific language that is defined by purely functional data structures, to describe imperative operations. Each operation is later interpreted into actual data mutation by an interpreter. This allows defining the algorithms as a combination of atomic DSL operations, which may have side-effects, in a purely functional way, while only the interpretation logic of those operations remains imperative.In contrast to a standard approach, the amount of the imperative code in free monad based algorithms scales depending on the number of atomic DSL operations instead of the size of the whole program. This can bring a big difference, especially if atomic operations are frequently re-used. As part of the research, an HTTP-based message exchange service was implemented in its own free monad based domain-specific language. As a basis, functional programming language Scala was used. The article shows the implementation of the server endpoint that sends the message to a specific conversation. The endpoint implementation consists of many atomic database queries and contains multiple branches of the control flow.First, each atomic database query or group of related queries is described as a pure functional record where dynamic query parameters are defined as values of the record fields. Then, using a free monad constructor, these operators are upgraded to monads and get the combination logic from the already existing monadic instances. Usually either one is used as a basis, since it provides a convenient way to handle exceptions. Next, atomic monadic operations are combined into algorithms. Most functional languages have syntactic sugar to write expressions, which combine multiple monads. In case of Scala, it is a for-expression. To interpret a DSL-based expression, an interpreter should be declared and used. The interpreter is a function with a defined behaviour for atomic DSL operations which are defined earlier. This behaviour may not be purely functional and may have side effects.The conducted research confirms that the usage of free monad based functional domain-specific programming languages is suitable in context of implementing HTTP-services, and it may significantly decrease the amount of the imperative code.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/real.1996.563705
Optimizing a FIFO, scalable spin lock using consistent memory
  • Dec 4, 1996
  • I Rhee

The paper presents a FIFO queue based, scalable spin lock (FSSL) that: (1) uses only one atomic swap operation; (2) is scalable as it requires a constant amount of communication; (3) runs without a coherent cache support; and (4) provides a timing guarantee suitable for real time applications. The algorithm is optimal in the number of atomic operations required to solve the scalable mutual exclusion problem in NUMA architectures, improving on T. Craig's (1993) spin lock that uses four atomic swap operations. The FSSL algorithm minimizes the number of atomic operations by replacing them with non atomic read and write operations, and takes good advantage of recent multiprocessors where non atomic memory operations are much more optimized than atomic operations. The algorithm runs correctly in various weakly consistent memories, providing a potentially significant speedup over algorithms with more atomic operations.

  • Book Chapter
  • Cite Count Icon 47
  • 10.1007/11516798_18
Lock-Free and Practical Doubly Linked List-Based Deques Using Single-Word Compare-and-Swap
  • Jan 1, 2005
  • Håkan Sundell + 1 more

We present an efficient and practical lock-free implementation of a concurrent deque that supports parallelism for disjoint accesses and uses atomic primitives which are available in modern computer systems. Previously known lock-free algorithms of deques are either based on non-available atomic synchronization primitives, only implement a subset of the functionality, or are not designed for disjoint accesses. Our algorithm is based on a general lock-free doubly linked list, and only requires single-word compare-and-swap atomic primitives. It also allows pointers with full precision, and thus supports dynamic deque sizes. We have performed an empirical study using full implementations of the most efficient known algorithms of lock-free deques. For systems with low concurrency, the algorithm by Michael shows the best performance. However, as our algorithm is designed for disjoint accesses, it performs significantly better on systems with high concurrency and non-uniform memory architecture. In addition, the proposed solution also implements a general doubly linked list, the first lock-free implementation that only needs the single-word compare-and-swap atomic primitive.KeywordsPrev PointerMutual ExclusionLinearizability PointPrevious NodeShared Memory DistributionThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

  • Research Article
  • Cite Count Icon 54
  • 10.1016/j.jpdc.2008.03.001
Lock-free deques and doubly linked lists
  • Mar 15, 2008
  • Journal of Parallel and Distributed Computing
  • Håkan Sundell + 1 more

Lock-free deques and doubly linked lists

  • Conference Article
  • Cite Count Icon 2
  • 10.1145/2069172.2069179
Reducing biased lock revocation by learning
  • Jul 26, 2011
  • Ian Rogers + 1 more

For languages supporting concurrency the implementation of synchronization primitives is important for achieving high-performance. Many concurrent languages use object based locks to control access to critical regions. When lock ownership doesn't change for most of its lifetime, lock biasing allows a thread to take ownership of an object so that atomic operations aren't necessary on lock entry and exit. Revoking ownership of locks biased to a thread is an expensive operation compared to the atomic operation, as the thread that holds the lock must be suspended.When lock revocation occurs it is common for the object being locked to be modified so that future lock attempts use atomic operations. When repeated revocations occur the locking policy can reduce the amount of lock biasing that the system performs. Factors that can drive this include the type of the object being revoked and how recently it was allocated. The system must achieve a balance between being pessimistic about biased lock use and avoiding revocations.This work introduces a new locking protocol where revocations can be sampled by the locker without having to bias. The mechanism provides locking information specific to a particular instance that can be used to avoid unprofitable bias lock speculation and create a better locking policy. We demonstrate a new instance specific locking policy implemented in the Zing Virtual Machine, an extension of the HotSpot Java Virtual Machine. We present results on how the sampling window effects the number of atomic lock operations and revocations for the SPECjvm2008 and DaCapo Bach benchmark suites.

  • Research Article
  • Cite Count Icon 49
  • 10.1145/1394608.1382154
Atomic Vector Operations on Chip Multiprocessors
  • Jun 1, 2008
  • ACM SIGARCH Computer Architecture News
  • Sanjeev Kumar + 8 more

The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). Vector parallelism can be more efficiently supported than multithreading, but is often harder for software to exploit. In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors. Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes. However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. This paper proposes architectural support for atomic vector operations (referred to as GLSC) that addresses this limitation. GLSC extends scatter-gather hardware to support atomic memory operations. Our experiments show that the GLSC provides an average performance improvement on a set of important RMS kernels of 54% for 4-wide SIMD.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1016/b978-0-408-01171-6.50005-4
Chapter 3 - Data processing systems
  • Jan 1, 1983
  • Data Processing
  • T F Fry

Chapter 3 - Data processing systems

  • Research Article
  • Cite Count Icon 285
  • 10.1016/j.jcp.2011.01.048
Fast analysis of molecular dynamics trajectories with graphics processing units—Radial distribution function histogramming
  • Feb 6, 2011
  • Journal of Computational Physics
  • Benjamin G Levine + 2 more

Fast analysis of molecular dynamics trajectories with graphics processing units—Radial distribution function histogramming

  • Book Chapter
  • Cite Count Icon 28
  • 10.1007/3-540-47993-7_6
Atomic Instructions in Java
  • Jan 1, 2002
  • David Hovemeyer + 2 more

Atomic instructions atomically access and update one or more memory locations. Because they do not incur the overhead of lock acquisition or suspend the executing thread during contention, they may allow higher levels of concurrency on multiprocessors than lock-based synchronization. Wait-free data structures are an important application of atomic instructions, and extend these performance benefits to higher level abstractions such as queues. In type-unsafe languages such as C, atomic instructions can be expressed in terms of operations on memory addresses. However, type-safe languages such as Java do not allow manipulation of arbitrary memory locations. Adding support for atomic instructions to Java is an interesting but important challenge.In this paper we consider several ways to support atomic instructions in Java. Each technique has advantages and disadvantages. We propose idiom recognition as the technique we feel has the best combination of expressiveness and simplicity. We describe techniques for recognizing instances of atomic operation idioms in the compiler of a Java Virtual Machine, and converting such instances into code utilizing atomic machine instructions. In addition, we describe a runtime technique which ensures that the semantics of multithreaded Java [11] are preserved when atomic instructions and blocking synchronization are used in the same program. Finally, we present benchmark results showing that for concurrent queues, a wait-free algorithm implemented using atomic compare-and-swap instructions yields better scalability on a large multiprocessor than a queue implemented with lock-based synchronization.KeywordsMemory LocationIntermediate RepresentationAtomic OperationCode InstructionTemplate PatternThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

  • Conference Article
  • Cite Count Icon 33
  • 10.1109/cluster.2011.34
Performance Characterization and Optimization of Atomic Operations on AMD GPUs
  • Sep 1, 2011
  • Marwa Elteir + 2 more

Atomic operations are important building blocks in supporting general-purpose computing on graphics processing units (GPUs). For instance, they can be used to coordinate execution between concurrent threads, and in turn, assist in constructing complex data structures such as hash tables or implementing GPU-wide barrier synchronization. While the performance of atomic operations has improved substantially on the latest NVIDIA Fermi-based GPUs, system-provided atomic operations still incur significant performance penalties on AMD GPUs. A memory-bound kernel on an AMD GPU, for example, can suffer severe performance degradation when including an atomic operation, even if the atomic operation is never executed. In this paper, we first quantify the performance impact of atomic instructions to application kernels on AMD GPUs. We then propose a novel software-based implementation of atomic operations that can significantly improve the overall kernel performance. We evaluate its performance against the system-provided atomic using two micro-benchmarks and four real applications. The results show that using our software based atomic operations on an AMD GPU can speedup an application kernel by 67-fold over the same application kernel but with the (default) system-provided atomic operations.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-319-31854-7_37
SSSP on GPU Without Atomic Operation
  • Jan 1, 2016
  • Feng Wang + 2 more

Graph is a general theoretical model in many large scale data-driven applications. SSSP Single Source Shortest Path algorithm is a foundation for most important algorithms and applications. GPU remains its mainstream station in high performance computing with heterogeneous architecture computers. Because of the high parallelization of the GPU threads, the distances of the vertices of the GPU are updated by atomic operations to avoid the read and write errors. Most atomic operations are unnecessary since the read-write conflicts are rare in large graph. However, without atomic operations the result accuracy can't be guaranteed. The atomic operations take large part of the running time of the program. To improve the performance of SSSP on GPU, we proposed an algorithm with data block iterations instead of atomic operations. The algorithm not only gets a high speed-up but also guarantees the accuracy of the result. Experimental results show that this SSSP algorithm gained a speedup of three times than the serial algorithm on CPU and more than ten times than the parallel algorithm on GPU with atomic operation.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/ipdpsw.2015.40
Implementing Cross-Device Atomics in Heterogeneous Processors
  • May 1, 2015
  • Meghana Gupta + 6 more

In this paper we describe how to support atomics across multiple devices in heterogeneous processors. Specifically, this paper provides an overview of how OpenCL 2.0 and Heterogeneous System Architecture (HSA) atomics are supported on integrated CPU-GPU processors called Accelerated Processing Units (APUs). Recently, the C11 and C++11 standards have introduced atomics and an associated memory model for supporting scalable parallel programming with memory consistency semantics. OpenCL 2.0 revision has extended these atomics for multiple devices each one of which can be a CPU or a GPU. The HSA Foundation in the HSA intermediate language (HSAIL) standard has also included support for various atomic operations that span multiple devices. All of these paradigms enable parallel threads running simultaneously on the CPU and GPU cores to synchronize using atomics that were not possible earlier. In APUs, the CPU and GPU cores are on the same die and can access a unified memory. Hence, such a platform provides an excellent opportunity for showcasing the power of OpenCL 2.0/HSA atomics across devices (henceforth referred to as cross-device atomics). In this work we show how we have added capabilities in our LLVM-based OpenCL compiler and a JIT-like finalizer to support cross-device atomics for APUs. Also, by supporting the new HSAIL atomic virtual operations in our finalizer, we have enabled the capability whereby other high-level languages which translate to HSAIL can support cross-device atomics as part of their evolving language standard. Our compiler is one of the first to support such cross-device atomics.

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/3422575.3422789
CircusTent: A Benchmark Suite for Atomic Memory Operations
  • Sep 28, 2020
  • Brody Williams + 4 more

A paradigm shift is currently taking place in the field of computer architecture. Consistent silicon-level processor improvements, relied upon in the past to drive the augmentation of system scalability, have stalled. As such, it is widely believed that future systems, wherein the design of hardware and software are more closely coupled, will need to leverage an increased degree of heterogeneity in order to realize further improvements. Parallel processing and corresponding programming models, already ubiquitous to high performance computing, will play a crucial role in these systems. Consequently, it is critically important to understand the interaction between these components. However, the behavior of atomic operations and associated synchronization primitives, which already represent a bottleneck in current systems, is difficult to quantify. Therefore, in this work, we introduce CircusTent, an open source, modular, and natively extensible benchmark suite for shared and distributed memory systems that is designed to measure the performance of a target architecture’s memory subsystem with respect to atomic operations. Herein, we first detail the design of CircusTent, which includes eight different kernels designed to replicate common atomic memory access patterns using two atomic primitives. We then demonstrate the capabilities of CircusTent through an evaluation of fourteen different platforms using our OpenMP benchmark implementation. In short, we believe CircusTent will prove to be an invaluable tool for the design and prototyping of emerging architectures.

  • Conference Article
  • 10.1145/2578948.2560697
Programming a Multicore Architecture without Coherency and Atomic Operations
  • Feb 7, 2014
  • Jochem H Rutgers + 2 more

It is hard to reason about the state of a multicore system-on-chip, because operations on memory need multiple cycles to complete, since cores communicate via an interconnect like a network-on-chip. To simplify programming, atomicity is required, by means of atomic read-modify-write (RMW) operations, a strong memory model, and hardware cache coherency. As a result, multicore architectures are very complex, but this stems from the fact that they are designed with an imperative programming paradigm in mind, i.e. based on threads that communicate via shared memory.In this paper, we show the impact on a multicore architecture, when the programming paradigm is changed and a λ-calculus-based (functional) language is used instead. Ordering requirements of memory operations are more relaxed and synchronization is simplified, because λ-calculus does not have a notion of state or memory, and therefore does not impose ordering requirements on the platform. We implemented a functional language for multicores with a weak memory model, without the need of hardware cache coherency, any atomic RMW operation, or mutex---the execution is atomic-free. Experiments show that even on a system with (transparently applied) software cache coherency, execution scales properly up to 32 cores. This shows that concurrent hardware complexity can be reduced by making different choices in the software layers on top.

  • Research Article
  • Cite Count Icon 3
  • 10.3765/amp.v5i0.4232
No Metathesis in Harmonic Serialism
  • Feb 10, 2018
  • Proceedings of the Annual Meetings on Phonology
  • Chikako Takahashi

This paper presents a Harmonic Serialism analysis of synchronic metathesis, proposing to eliminate metathesis as an atomic operation, instead analyzing apparent metathesis cases as a result of the sequential application of simpler operations such as copy + deletion or fusion + fission, and not as segment reordering. The analysis of Rotuman phase alternation in this paper offers a unified account of apparent metathesis, deletion, and umlaut as all going through the processes of copy + deletion and subsequent fusion. Balangao CC metathesis is analyzed as fusion + fission incorporating an idea that CC metathesis is phonetically motivated. Removing the atomic metathesis operation has several benefits: (a) it simplifies the inventory of operations in Harmonic Serialism, (b) it correctly predicts locality restrictions on metathesis patterns without the help of other constraints that are otherwise needed in HS analysis, (c) and it correctly predicts the typological restrictions on the types of segments that undergo CC metathesis.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant