Performance Modeling of Atomic Additions on GPU Scratchpad Memory

Juan Gomez-Luna,Nicolas Guil Mata,Jose Maria Gonzalez-Linares,Jose Ignacio Benavides Benitez

doi:10.1109/tpds.2012.319

Abstract

GPU application implementations using scatter approaches will fall into write contention due to atomic updates of output elements, if these result from more than one input element. Colliding threads will be serialized, seriously harming performance. Dealing with these issues requires a proper understanding of the behavior of the scratchpad or shared memory under conflicting accesses caused by concurrent threads. Thus, this paper presents an exhaustive microbenchmark-based analysis of atomic additions in shared memory that quantifies the impact of access conflicts on latency and throughput. This analysis has led us to discover the lock mechanism that enables atomic updates to shared memory and to propose a performance model to estimate the latency penalties due to collisions by position or bank conflicts. Then, we have derived experiments from this model that show us the way to optimize applications using atomic operations. Position and bank conflicts can be diminished by replication and padding, respectively. The benefits of such techniques are illustrated with the optimization of two widely used voting processes: the centroid updating step in k-means clustering, and histogram calculation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Performance Modeling of Atomic Additions on GPU Scratchpad Memory

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Nov 1, 2013
Citations: 58

Similar Papers

Performance Characterization and Optimization of Atomic Operations on AMD GPUs
Marwa Elteir ... Heshan Lin
-
Marwa Elteir, et. al.Marwa Elteir ... Heshan Lin
01 Sep 2011
01 Sep 2011

Approaches for parallelizing reductions on modern GPUs
Xin Huo ... Wenjing Ma
-
Xin Huo, et. al.Xin Huo ... Wenjing Ma
01 Dec 2010
01 Dec 2010

Design and Application of a Non-wrapped Programmable Logic Controller (PLC) Laboratory Kit for Automatic Control Education
Ying Wang ... Meilan Liu
-
Ying Wang, et. al.Ying Wang ... Meilan Liu
07 Nov 2022
07 Nov 2022

A Concurrent Specification of POSIX File Systems
...
-
, et. al. ...
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance Modeling of Atomic Additions on GPU Scratchpad Memory

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems