Buddy SM

Tao Zhang,Naifeng Jing,Kaiming Jiang,Wei Shu,Min-You Wu,Xiaoyao Liang

doi:10.1145/2744202

Abstract

A modern general-purpose graphics processing unit (GPGPU) usually consists of multiple streaming multiprocessors (SMs), each having a pipeline that incorporates a group of threads executing a common instruction flow. Although SMs are designed to work independently, we observe that they tend to exhibit very similar behavior for many workloads. If multiple SMs can be grouped and work in the lock-step manner, it is possible to save energy by sharing the front-end units among multiple SMs, including the instruction fetch, decode, and schedule components. However, such sharing brings architectural challenges and sometime causes performance degradation. In this article, we show our design, implementation, and evaluation for such an architecture, which we call Buddy SM . Specifically, multiple SMs can be opportunistically grouped into a buddy cluster. One SM becomes the master, and the rest become the slaves. The front-end unit of the master works actively for itself as well as for the slaves, whereas the front-end logics of the slaves are power gated. For efficient flow control and program correctness, the proposed architecture can identify unfavorable conditions and ungroup the buddy cluster when necessary. We analyze various techniques to improve the performance and energy efficiency of Buddy SM. Detailed experiments manifest that 37.2% front-end and 7.5% total GPU energy reduction can be achieved.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Buddy SM

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Architecture and Code Optimization

Lead the way for us

Journal: ACM Transactions on Architecture and Code Optimization	Publication Date: May 11, 2015
Citations: 5

Similar Papers

Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications
Kyungwoon Cho ... Hyokyung Bahn
Applied Sciences | VOL. 10
Kyungwoon Cho, et. al.Kyungwoon Cho ... Hyokyung Bahn
20 Dec 2020
Applied Sciences | VOL. 10

Analysis of Thread Block Scheduling Algorithms for General Purpose GPU Systems
Soyeon Park ... Kyungwoon Cho
-
Soyeon Park, et. al.Soyeon Park ... Kyungwoon Cho
08 Dec 2021
08 Dec 2021

Process variation-aware workload partitioning algorithms for GPUs supporting spatial-multitasking
Paula Aguilera ... Nam Sung Kim
-
Paula Aguilera, et. al.Paula Aguilera ... Nam Sung Kim
01 Jan 2014
01 Jan 2014

Compressed L1 data cache and L2 cache in GPGPUs
Ehsan Atoofian
-
Ehsan AtoofianEhsan Atoofian
01 Jul 2016
01 Jul 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Buddy SM

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Architecture and Code Optimization