Abstract

One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and causes pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel Elastic Pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed Elastic Pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0 % (with 42.3 % on average) and improves the overall performance by up to 20.7 % (on average 13.3 %) for representative benchmarks, at trivial hardware overhead.

Highlights

  • The trend is quite clear that multi/many-core processors are becoming pervasive computing platforms nowadays

  • We propose novel Elastic Pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput

  • Our work addresses the problem of GPU on-chip shared memory bank conflicts, which seems largely orthogonal to the GPU

Read more

Summary

Introduction

The trend is quite clear that multi/many-core processors are becoming pervasive computing platforms nowadays. GPUs are originally designed for graphics processing, the performance of many well tuned general purpose applications on GPUs have established them among one of the most attractive computing platforms in a more general context—leading to the GPGPU (General-purpose Processing on GPUs) domain [2] In manycore systems such as GPUs, massive multithreading is used to hide long latencies of the core pipeline, interconnect and different memory hierarchy levels. We determine that the throughput of the GPU processor core is often hampered neither by the on-chip memory bandwidth, nor by the on-chip memory latency (as long as it stays constant), but rather by the varied latencies due to memory bank conflicts, which end up with writeback conflicts and pipeline stalls in the in-order pipeline, degrading system throughput To address this problem, we propose novel Elastic Pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput.

Background and Motivation
Programming Model Properties
Baseline Manycore Barrel Processing Architecture
Shared Memory Access on GPGPU
Latency and Bandwidth Implications
Pipeline Performance Degradation Due to Bank Conflicts
Elastic Pipeline Design
Safe Scheduling Distance and Conflict Tolerance
Out-of-Order Instruction Commitment
Extension for Large Warp Size
Hardware Overhead and Impact on Pipeline Timing
Bank-Conflict Aware Warp Scheduling
Obtaining Bank Conflict Information
Bank Conflict History Cache
Proposed Warp Scheduling Scheme
Hardware Overhead
Experimental Evaluation
Effect on Pipeline Stall Reduction
Performance Improvements
Performance of Non-Conflicting Kernels
Interaction with Off-chip DRAM Access
Discussions
Related Work
Findings
Conclusions and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call