TELEPORT: Hardware/software alternative to CUDA shared memory programming

Ahmad Lashgar,Ehsan Atoofian,Amirali Baniasadi

doi:10.1016/j.micpro.2018.09.004

Abstract

Using software-managed cache in CUDA programming provides significant potential to improve memory efficiency. Employing this feature requires the programmer to identify data tiles associated with thread blocks and bring them to the cache explicitly. Despite the advantages, the development effort required to exploit this feature can be significant. The goal of this paper is to reduce this effort while maintaining the associated benefits. To this end, we first investigate static precalculability in memory accesses for GPGPU workloads, at the thread block granularity. We show that a significant share of addresses can be precalculated knowing thread block identifiers. We build on this observation and introduce TELEPORT. TELEPORT is a novel hardware/software scheme for delivering performance competitive to software-managed cache programming, but at no extra development effort. On the software side, TELEPORT’s static analyzer parses the kernel and finds precalculable memory accesses. We introduce Runtime API calls to pass this information to hardware. On the hardware side, this information is used to fetch the data required for each thread block into shared memory before the thread block starts execution. With this hardware support, TELEPORT outperforms hand-written CUDA code as a result of the associated DRAM row locality improvement. Investigating a wide set of benchmarks, we show that TELEPORT improves performance of hand-written implementations, on average, by 32% while reducing development effort by 2.5X. Our estimations show that the hardware overhead associated with TELEPORT is below 1%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

TELEPORT: Hardware/software alternative to CUDA shared memory programming

Abstract

Talk to us

Similar Papers

More From: Microprocessors and Microsystems

Lead the way for us

Journal: Microprocessors and Microsystems	Publication Date: Sep 14, 2018
Citations: 1

Similar Papers

Shared memory multiplexing
Yi Yang ... Huiyang Zhou
-
Yi Yang, et. al.Yi Yang ... Huiyang Zhou
19 Sep 2012
19 Sep 2012

Static code transformations for thread‐dense memory accesses in GPU computing
Hyunjun Kim ... Jeonghwan Park
Concurrency and Computation: Practice and Experience | VOL. 32
Hyunjun Kim, et. al.Hyunjun Kim ... Jeonghwan Park
18 Oct 2019
Concurrency and Computation: Practice and Experience | VOL. 32

Efficient automatic parallelization of a single GPU program for a multiple GPU system
Matam Kiran Kumar ... Murali Annavaram
Integration | VOL. 66
Matam Kiran Kumar, et. al.Matam Kiran Kumar ... Murali Annavaram
07 Jan 2019
Integration | VOL. 66

OSM: Off-Chip Shared Memory for GPUs
Sina Darabi ... Pejman Lotfi-Kamran
IEEE Transactions on Parallel and Distributed Systems | VOL. 33
Sina Darabi, et. al.Sina Darabi ... Pejman Lotfi-Kamran
01 Dec 2022
IEEE Transactions on Parallel and Distributed Systems | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TELEPORT: Hardware/software alternative to CUDA shared memory programming

Abstract

Talk to us

Similar Papers

More From: Microprocessors and Microsystems