Warp Scheduling Research Articles

General-purpose computing using graphics processing units (GPGPUs) is an attractive option for acceleration of applications with massively data-parallel tasks. While performance of modern GPGPUs is increasing rapidly, the power consumption of these devices is becoming a major concern. In particular, execution units and register file are among the top three most power-hungry components in GPGPUs. In this work, we exploit trivial instructions to reduce power consumption in GPGPUs. Trivial instructions are those instructions that do not need computations, i.e., multiplication by one. We found that, during the course of a program's execution, a GPGPU executes many trivial instructions. Execution of these instructions wastes power unnecessarily. In this work, we propose trivial bypassing which skips execution of trivial instructions and avoids unnecessary allocation of resources for trivial instructions. By power gating execution units and skipping trivial computing, trivial bypassing reduces both static and dynamic power. Also, trivial bypassing reduces dynamic energy of register file by avoiding access to register file for source and/or destination operands of trivial instructions. While trivial bypassing reduces energy of GPGPUs, it has detrimental impact on performance as a power-gated execution unit requires several cycles to resume its normal operation. Conventional warp schedulers are oblivious to the status of execution units. We propose a new warp scheduler that prioritizes warps based on availability of execution units. We also propose a set of new power management techniques to reduce performance penalty of power gating, further. To increase energy saving of trivial bypassing, we also propose approximating operands of instructions. We offer a set of new techniques to approximate both integer and floating-point instructions and increase the pool of trivial instructions. Our evaluations using a diverse set of benchmarks reveal that our proposed techniques are able to reduce energy of execution units by 11.2% and dynamic energy of register file by 12.2% with minimal performance and quality degradation.

Read full abstract

The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Initially, GPUs only employ scratchpad memory as on-chip memory. Recently, to broaden the scope of applications that can be accelerated by GPUs, GPU vendors have used caches as on-chip memory in the new generations of GPUs. Unfortunately, GPU caches face many performance challenges that arise due to the excessive thread contention for cache resource. Cache bypassing, where the memory requests can selectively bypass the cache, is one of the solutions that can help to mitigate the cache resource contention problem. In this paper, we propose coordinated static and dynamic cache bypassing to improve the GPU application performance. At compile-time, we identify the global loads that indicate strong preferences for caching or bypassing and encode the classification into the application binary. For the rest global loads, our dynamic cache bypassing has the flexibility to cache only a fraction of threads. In addition to coordinated bypassing, we also develop a bypass-aware warp scheduler to adaptively adjust the scheduling policy based on the cache performance. Evaluations show that our coordinated static and dynamic cache bypassing technique achieves up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$2.28\boldsymbol \times $ </tex-math></inline-formula> (average <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.32\boldsymbol \times $ </tex-math></inline-formula> ) performance speedup for a variety of GPU applications. When we combine the coordinated cache bypassing with the bypass-aware scheduler, the average speedup is further improved to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.38\boldsymbol \times $ </tex-math></inline-formula> .

Read full abstract

Warp Scheduling Research Articles

Related Topics

Articles published on Warp Scheduling

WSMP: a warp scheduling strategy based on MFQ and PPF

An Effective Method to Identify Microarchitectural Vulnerabilities in GPUs

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU.

Exit CTA Conscious Warp Scheduling

Architecture exploration of recent GPUs to analyze the efficiency of hardware resources

Reducing Energy in GPGPUs through Approximate Trivial Bypassing

Coordinated thread block scheduling and warp scheduler for workload distribution

Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPU

An On-Line Testing Technique for the Scheduler Memory of a GPGPU

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

FRF: Toward Warp-Scheduler Friendly STT-RAM/SRAM Fine-Grained Hybrid GPGPU Register File Design

Exploring Warp Criticality in Near-Threshold GPGPU Applications Using a Dynamic Choke Point Analysis

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs

Memory Request Priority Based Warp Scheduling for GPUs

Optimizing Cache Bypassing and Warp Scheduling for GPUs

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Формалізований метод проектування застосувань в технології GPGPU

Dynamic Resizing on Active Warps Scheduler to Hide Operation Stalls on GPUs

An Energy-Efficient GPGPU Register File Architecture Using Racetrack Memory

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Warp Scheduling Research Articles

Related Topics

Articles published on Warp Scheduling

WSMP: a warp scheduling strategy based on MFQ and PPF

An Effective Method to Identify Microarchitectural Vulnerabilities in GPUs

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU.

Exit CTA Conscious Warp Scheduling

Architecture exploration of recent GPUs to analyze the efficiency of hardware resources

Reducing Energy in GPGPUs through Approximate Trivial Bypassing

Coordinated thread block scheduling and warp scheduler for workload distribution

Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPU

An On-Line Testing Technique for the Scheduler Memory of a GPGPU

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

FRF: Toward Warp-Scheduler Friendly STT-RAM/SRAM Fine-Grained Hybrid GPGPU Register File Design

Exploring Warp Criticality in Near-Threshold GPGPU Applications Using a Dynamic Choke Point Analysis

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs

Memory Request Priority Based Warp Scheduling for GPUs

Optimizing Cache Bypassing and Warp Scheduling for GPUs

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Формалізований метод проектування застосувань в технології GPGPU

Dynamic Resizing on Active Warps Scheduler to Hide Operation Stalls on GPUs

An Energy-Efficient GPGPU Register File Architecture Using Racetrack Memory